<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Programme to uncompress the archives

## Prerequisites

### Environment variables

AWS credentials and other parameters should be stored in the `.env` file.

In [None]:
AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
REGION_NAME=<YOUR_REGION_NAME>
SOURCE_BUCKET_NAME=<YOUR_SOURCE_BUCKET_NAME>
DESTINATION_BUCKET_NAME=<YOUR_DESTINATION_BUCKET_NAME>
INPUT_DIRECTORY=<OUTPUT_DIRECTORY_input>
OUTPUT_DIRECTORY=<OUTPUT_DIRECTORY>

### Required libraries

The required libraries are going to be stored in the `unarchive.req` file.
- Create the environment `my_env` with the command: `python3 -m venv my_env`
- Activate the `my_env`: `cd my_env && source bin/activate`
- The following command should be executed: `pip install -r unarchive.req`

#### Contents of `unarchive.req`

In [None]:
python-dotenv
boto3
pandas

### Execution in the background

In [None]:
nohup python -u unarchive.py &

## Code of `unarchive.py`

In [None]:
# Edit the file '.env' and provide the required parameters
# Install the required libraries in the environment by executing: 'pip install -r unarchive.req'

# Importing the required libraries
from dotenv import load_dotenv
import boto3
import pandas as pd
import tarfile
import bz2
import os
import sys
import shutil
import datetime

load_dotenv()  # This line brings all environment variables from '.env' into 'os.environ'

# Define the name of the CSV file containing the list of S3 keys
key_list = 'unarchive_key_list_test.csv'
#key_list = 'unarchive_key_list_2011.csv'
#key_list = 'unarchive_key_list_2012.csv'
#key_list = 'unarchive_key_list_2013.csv'
#key_list = 'unarchive_key_list_2014.csv'
#key_list = 'unarchive_key_list_2015.csv'
#key_list = 'unarchive_key_list_2016.csv'
#key_list = 'unarchive_key_list_2017.csv'
#key_list = 'unarchive_key_list_2018.csv'
#key_list = 'unarchive_key_list_2019.csv'
#key_list = 'unarchive_key_list_2020.csv'
#key_list = 'unarchive_key_list_2021.csv'
#key_list = 'unarchive_key_list_2022.csv'
#key_list = 'unarchive_key_list_2023.csv'

# Set up AWS credentials
aws_access_key_id = os.environ['AWS_ACCESS_KEY_ID']
aws_secret_access_key = os.environ['AWS_SECRET_ACCESS_KEY']
region_name = os.environ['REGION_NAME']

# Set up S3 client
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key, region_name=region_name)

# Set up the source and destination S3 bucket names
source_bucket_name = os.environ['SOURCE_BUCKET_NAME']
destination_bucket_name = os.environ['DESTINATION_BUCKET_NAME']

# Define the name of the directory where the downloaded files will be stored
input_directory = os.environ['INPUT_DIRECTORY']

# Check if the input directory already exists. If it does, remove it and its contents. If it doesn't exist, create it.
if os.path.exists(input_directory):
    shutil.rmtree(input_directory)
    print('Old output directory successfully removed.')
    try:
        os.makedirs(input_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)
else:
    try:
        os.makedirs(input_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

# Define the name of the directory where the unarchived files will be stored
output_directory = os.environ['OUTPUT_DIRECTORY']

# Check if the output directory already exists. If it does, remove it and its contents. If it doesn't exist, create it.
if os.path.exists(output_directory):
    shutil.rmtree(output_directory)
    print('Old output directory successfully removed.')
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

# Read the key CSV file into a pandas DataFrame
df = pd.read_csv(key_list, header=0)

# Iterate over each row in the DataFrame
for index, row in df.iterrows():
    tar_file_key = row['filename-destination']
    timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print(timestamp, ': Downloading ' + tar_file_key)
    s3.download_file(source_bucket_name, tar_file_key, input_directory + '/' + tar_file_key)
    timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print(timestamp, ': Extracting ' + tar_file_key)
    with tarfile.open(input_directory + '/' + tar_file_key, 'r') as tar:
        tar.extractall(path=output_directory)
    # Iterate over the extracted files
    timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print(timestamp, ': Extracting and transferring .bz2 files to S3 for ' + tar_file_key)
    for root, dirs, files in os.walk(output_directory):
        for file in files:
            if file.endswith('.bz2'):
                # Uncompress each .bz2 file
                with bz2.open(os.path.join(root, file), 'rb') as bz_file:
                    uncompressed_data = bz_file.read()
                
                    # Get the relative path of the file within the directory tree
                    relative_path = os.path.relpath(os.path.join(root, file), '.')
                
                    # Upload the processed file to the destination S3 bucket with the same directory tree structure
                    destination_key = os.path.join(relative_path, file)
                    s3.put_object(Body=uncompressed_data, Bucket=destination_bucket_name, Key=destination_key)
                    timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                    print(timestamp, ':', file, 'transferred')
    shutil.rmtree(input_directory)
    os.makedirs(input_directory)
    shutil.rmtree(output_directory)
    os.makedirs(output_directory)
    timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    print(timestamp, ': Input and output directories cleared out for ' + tar_file_key)


## Sample code to parse the tweets files ([JSONL](https://spark.apache.org/docs/latest/sql-data-sources-json.html) format) to a dataframe over Amazon EMR

In [None]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("Tweet Analysis") \
    .getOrCreate()

# Read the JSONL files from S3 and create a dataframe
tweets_df = spark.read.json("s3://your-bucket/tweets.jsonl", multiLine=True)

# Filter the dataframe to extract the desired data
filtered_df = tweets_df.filter(tweets_df['text'].contains("your-filter-keyword"))

# Show the filtered dataframe
filtered_df.show()


In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("The Twitter Grab 2019 Corpus").getOrCreate()

# Set the S3 bucket and folder paths
source_bucket = 'gelctweets'
year = '2019'
month = '01'
data_source = 's3://' + source_bucket + '/' + year + '_' + month + '/**/*.json.bz2'

# Read the JSONL files into a DataFrame
df = spark.read.json(data_source)

# Show the first few rows of the DataFrame
df.show()


# snscrape format

In [14]:
import pandas as pd

# Read the JSON file into a DataFrame
df1 = pd.read_json('snscrape.json', lines=True)

In [16]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   _type             200 non-null    object             
 1   url               200 non-null    object             
 2   date              200 non-null    datetime64[ns, UTC]
 3   rawContent        200 non-null    object             
 4   renderedContent   200 non-null    object             
 5   id                200 non-null    int64              
 6   user              200 non-null    object             
 7   replyCount        200 non-null    int64              
 8   retweetCount      200 non-null    int64              
 9   likeCount         200 non-null    int64              
 10  quoteCount        200 non-null    int64              
 11  conversationId    200 non-null    int64              
 12  lang              200 non-null    object             
 13  sourc

In [18]:
import json

quoted_tweet = df1.loc[7, 'quotedTweet']
quoted_tweet_str = json.dumps(quoted_tweet)
quoted_tweet_dict = json.loads(quoted_tweet_str)

media = quoted_tweet_dict['media']
print(media)

[{'_type': 'snscrape.modules.twitter.Photo', 'previewUrl': 'https://pbs.twimg.com/media/DyLY0iQW0AAk5XK?format=jpg&name=small', 'fullUrl': 'https://pbs.twimg.com/media/DyLY0iQW0AAk5XK?format=jpg&name=large'}]


In [20]:
df1

Unnamed: 0,_type,url,date,rawContent,renderedContent,id,user,replyCount,retweetCount,likeCount,...,retweetedTweet,quotedTweet,inReplyToTweetId,inReplyToUser,mentionedUsers,coordinates,place,hashtags,cashtags,card
0,snscrape.modules.twitter.Tweet,https://twitter.com/AmarilsaPi/status/10907603...,2019-01-30 23:55:17+00:00,@estudioi Não estou preocupada com o custo fin...,@estudioi Não estou preocupada com o custo fin...,1090760391850934281,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,0,...,,,1.090622e+18,"{'_type': 'snscrape.modules.twitter.User', 'us...","[{'_type': 'snscrape.modules.twitter.User', 'u...",,,,,
1,snscrape.modules.twitter.Tweet,https://twitter.com/FranciscoSenaDF/status/109...,2019-01-30 23:54:12+00:00,@Antonbaptista @folha Venezuela é ditadura gen...,@Antonbaptista @folha Venezuela é ditadura gen...,1090760120911491073,"{'_type': 'snscrape.modules.twitter.User', 'us...",3,0,1,...,,,1.090757e+18,"{'_type': 'snscrape.modules.twitter.User', 'us...","[{'_type': 'snscrape.modules.twitter.User', 'u...",,,,,
2,snscrape.modules.twitter.Tweet,https://twitter.com/eduardolm171/status/109075...,2019-01-30 23:51:42+00:00,Bolsonaro se encontrou com presidente genocida...,Bolsonaro se encontrou com presidente genocida...,1090759491627503616,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,0,...,,,,,"[{'_type': 'snscrape.modules.twitter.User', 'u...",,,,,
3,snscrape.modules.twitter.Tweet,https://twitter.com/VPFac/status/1090756459099...,2019-01-30 23:39:39+00:00,"@PolitzOficial È uma questão humanitária, sim,...","@PolitzOficial È uma questão humanitária, sim,...",1090756459099537408,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,0,...,,,1.090375e+18,"{'_type': 'snscrape.modules.twitter.User', 'us...","[{'_type': 'snscrape.modules.twitter.User', 'u...",,,[chorapetralha],,
4,snscrape.modules.twitter.Tweet,https://twitter.com/DrReynaldo/status/10907552...,2019-01-30 23:34:44+00:00,@monicabergamo A Sra pensa q é correto esse ge...,@monicabergamo A Sra pensa q é correto esse ge...,1090755218566975488,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,2,...,,,1.090623e+18,"{'_type': 'snscrape.modules.twitter.User', 'us...","[{'_type': 'snscrape.modules.twitter.User', 'u...",,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,snscrape.modules.twitter.Tweet,https://twitter.com/teodeiocomamor/status/1090...,2019-01-29 21:59:00+00:00,"@gleisi Porque é de um corrupto, assassino, g...","@gleisi Porque é de um corrupto, assassino, g...",1090368739349094405,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,1,...,,,1.090367e+18,"{'_type': 'snscrape.modules.twitter.User', 'us...","[{'_type': 'snscrape.modules.twitter.User', 'u...",,,,,
196,snscrape.modules.twitter.Tweet,https://twitter.com/Melo_Guilherme/status/1090...,2019-01-29 21:58:48+00:00,"@MBLivre Me lembra um genocida, chamado Mandel...","@MBLivre Me lembra um genocida, chamado Mandel...",1090368689793351681,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,0,...,,,1.090262e+18,"{'_type': 'snscrape.modules.twitter.User', 'us...","[{'_type': 'snscrape.modules.twitter.User', 'u...",,,,,
197,snscrape.modules.twitter.Tweet,https://twitter.com/andersonino/status/1090364...,2019-01-29 21:42:40+00:00,"@gwannael @folha O que é o que é, tem cara de ...","@gwannael @folha O que é o que é, tem cara de ...",1090364630176878593,"{'_type': 'snscrape.modules.twitter.User', 'us...",1,0,0,...,,,1.090351e+18,"{'_type': 'snscrape.modules.twitter.User', 'us...","[{'_type': 'snscrape.modules.twitter.User', 'u...",,,,,
198,snscrape.modules.twitter.Tweet,https://twitter.com/renata_passari/status/1090...,2019-01-29 21:08:40+00:00,@GloboNews @MarceloLins68 Vão dizer que foi go...,@GloboNews @MarceloLins68 Vão dizer que foi go...,1090356075071766530,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,0,...,,,1.090349e+18,"{'_type': 'snscrape.modules.twitter.User', 'us...","[{'_type': 'snscrape.modules.twitter.User', 'u...",,,,,


# Internet Archive format

In [22]:
import pandas as pd

# Read the JSON file into a DataFrame
df2 = pd.read_json('intarch.json', lines=True)

  df2 = pd.read_json('intarch.json', lines=True)
  df2 = pd.read_json('intarch.json', lines=True)


In [24]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 37 columns):
 #   Column                     Non-Null Count  Dtype              
---  ------                     --------------  -----              
 0   created_at                 176 non-null    datetime64[ns, UTC]
 1   id                         176 non-null    float64            
 2   id_str                     176 non-null    float64            
 3   text                       176 non-null    object             
 4   display_text_range         45 non-null     object             
 5   source                     176 non-null    object             
 6   truncated                  176 non-null    float64            
 7   in_reply_to_status_id      38 non-null     float64            
 8   in_reply_to_status_id_str  38 non-null     float64            
 9   in_reply_to_user_id        39 non-null     float64            
 10  in_reply_to_user_id_str    39 non-null     float64            
 11  in_rep

In [26]:
df2

Unnamed: 0,created_at,id,id_str,text,display_text_range,source,truncated,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,...,timestamp_ms,retweeted_status,extended_entities,possibly_sensitive,quoted_status_id,quoted_status_id_str,quoted_status,quoted_status_permalink,extended_tweet,delete
0,2019-01-01 07:29:00+00:00,1.080003e+18,1.080003e+18,@supr_dorapo 最新のゼルダやってないけどゼルダも好きですﾋﾟｭｰﾋﾟｭｰﾋﾟｭﾋ...,"[13, 58]","<a href=""http://twitter.com/download/android"" ...",0.0,1.080003e+18,1.080003e+18,2.787406e+09,...,2019-01-01 07:28:07.168,,,,,,,,,
1,2019-01-01 07:29:00+00:00,1.080003e+18,1.080003e+18,そうだよ、僕がピーマンだよ,,"<a href=""http://twittbot.net/"" rel=""nofollow"">...",0.0,,,,...,2019-01-01 07:28:07.168,,,,,,,,,
2,2019-01-01 07:29:00+00:00,1.080003e+18,1.080003e+18,@subaru_yamamoto 勝ち退くが利って、それ得意なひとTwitterにたくさんい...,"[17, 66]","<a href=""http://twitter.com/download/iphone"" r...",0.0,1.080002e+18,1.080002e+18,3.955496e+09,...,2019-01-01 07:28:07.168,,,,,,,,,
3,2019-01-01 07:29:00+00:00,1.080003e+18,1.080003e+18,RT @PPGcnjp: 初もうで😊 https://t.co/NOfd4q1ro5,,"<a href=""http://twitter.com/download/iphone"" r...",0.0,,,,...,2019-01-01 07:28:07.168,{'created_at': 'Tue Jan 01 07:24:00 +0000 2019...,"{'media': [{'id': 1076011587624873984, 'id_str...",1.0,,,,,,
4,2019-01-01 07:29:00+00:00,1.080003e+18,1.080003e+18,RT @DJ_GINTA_NEW: 【お正月の歌作ったよん🎍】 https://t.co/e...,,"<a href=""http://twitter.com/download/iphone"" r...",0.0,,,,...,2019-01-01 07:28:07.168,{'created_at': 'Tue Jan 01 07:22:06 +0000 2019...,"{'media': [{'id': 1080001090727247872, 'id_str...",0.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,2019-01-01 07:29:04+00:00,1.080003e+18,1.080003e+18,@Fams000 @Rouguii77 +mention inch'allah bonne ...,"[20, 59]","<a href=""http://twitter.com/download/android"" ...",0.0,1.079895e+18,1.079895e+18,2.878622e+09,...,2019-01-01 07:28:07.168,,,,,,,,,
196,2019-01-01 07:29:04+00:00,1.080003e+18,1.080003e+18,RT @Sphenodontiaart: hope to god they kiss for...,,"<a href=""http://twitter.com/download/iphone"" r...",0.0,,,,...,2019-01-01 07:28:07.168,{'created_at': 'Mon Dec 31 06:09:07 +0000 2018...,"{'media': [{'id': 1079620371505065984, 'id_str...",0.0,,,,,,
197,2019-01-01 07:29:04+00:00,1.080003e+18,1.080003e+18,RT @fuzakefactory: 明けましておめでとうございます🐗\n新年お年玉企画🎉\...,,"<a href=""http://twitter.com/download/iphone"" r...",0.0,,,,...,2019-01-01 07:28:07.168,{'created_at': 'Tue Jan 01 06:45:32 +0000 2019...,,,,,,,,
198,2019-01-01 07:29:04+00:00,1.080003e+18,1.080003e+18,Feliz año nuevo! Happy new year! #2019 @ Nuevo...,,"<a href=""http://instagram.com"" rel=""nofollow"">...",0.0,,,,...,2019-01-01 07:28:07.168,,,0.0,,,,,,
