<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Mariana - Dataset preparation

## Tweet Object documentation

Please refer to [Tweet Object](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet)

## Importing the required libraries

In [1]:
import re
import pandas as pd

## Setting input and output filenames

Set the `input_filename` with the filename of the file to be processed.

In [2]:
input_filename = 'mari2016_1.jsonl'
suffix = '_pt'

def add_suffix(filename):
    # Extract the base filename without the extension
    base_filename = re.match(r'^([A-Za-z0-9-_,\s]+)\.[A-Za-z]{1,5}$', filename).group(1)
    
    # Append suffix to the base filename
    new_filename = f'{base_filename}{suffix}'
    
    # Add the original file extension back
    new_filename += re.search(r'\.[A-Za-z]{1,5}$', filename).group()
    
    return new_filename

output_filename = add_suffix(input_filename)

## Data wrangling

### Consolidating multiple JSONL files into one JSONL file

Note: You have to download and open this Jupyter Notebook on JupyterLab (provided as part of Anaconda Distribution) to visualise the procedure

### Importing the tweet raw data into a dataframe

In [3]:
df_tweets_raw_data = pd.read_json(input_filename, lines=True)

In [4]:
df_tweets_raw_data.head(5)

Unnamed: 0,created_at,entities,favorite_count,favorited,filter_level,id,id_str,is_quote_status,lang,possibly_sensitive,...,in_reply_to_user_id,in_reply_to_user_id_str,extended_entities,place,quoted_status,quoted_status_id,quoted_status_id_str,coordinates,geo,scopes
0,2016-01-15 14:24:21+00:00,"{'hashtags': [], 'symbols': [], 'urls': [{'dis...",0,False,low,688003794085351425,688003794085351424,True,es,0.0,...,,,,,,,,,,
1,2016-01-15 14:24:23+00:00,"{'hashtags': [], 'symbols': [], 'urls': [], 'u...",0,False,low,688003802478297088,688003802478297088,False,pt,,...,812509.0,812509.0,,,,,,,,
2,2016-01-15 14:24:52+00:00,"{'hashtags': [{'indices': [62, 82], 'text': 'P...",0,False,low,688003924134068224,688003924134068224,False,es,0.0,...,,,{'media': [{'display_url': 'pic.twitter.com/c0...,,,,,,,
3,2016-01-15 14:25:19+00:00,"{'hashtags': [], 'symbols': [], 'urls': [], 'u...",0,False,low,688004037355155456,688004037355155456,False,pt,,...,,,,,,,,,,
4,2016-01-15 14:25:25+00:00,"{'hashtags': [], 'symbols': [], 'urls': [{'dis...",0,False,low,688004062537736192,688004062537736192,False,es,0.0,...,,,,,,,,,,


In [5]:
df_tweets_raw_data.shape

(235483, 31)

### Checking if data types are consistent

In [6]:
df_tweets_raw_data.dtypes

created_at                   datetime64[ns, UTC]
entities                                  object
favorite_count                             int64
favorited                                   bool
filter_level                              object
id                                         int64
id_str                                     int64
is_quote_status                             bool
lang                                      object
possibly_sensitive                       float64
retweet_count                              int64
retweeted                                   bool
retweeted_status                          object
source                                    object
text                                      object
timestamp_ms                      datetime64[ns]
truncated                                   bool
user                                      object
in_reply_to_screen_name                   object
in_reply_to_status_id                    float64
in_reply_to_status_i

#### Converting `id` column's data type to `str` for future use

Note: For some unknown reason, pandas has imported the attribute `id_str` incorrectly in some cases. Therefore, `id` is  being used instead.

In [7]:
df_tweets_raw_data['id'] = df_tweets_raw_data['id'].astype(str)

### Dropping unnecessary columns

#### Listing the columns

In [8]:
df_tweets_raw_data.columns.values.tolist()

['created_at',
 'entities',
 'favorite_count',
 'favorited',
 'filter_level',
 'id',
 'id_str',
 'is_quote_status',
 'lang',
 'possibly_sensitive',
 'retweet_count',
 'retweeted',
 'retweeted_status',
 'source',
 'text',
 'timestamp_ms',
 'truncated',
 'user',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id',
 'in_reply_to_user_id_str',
 'extended_entities',
 'place',
 'quoted_status',
 'quoted_status_id',
 'quoted_status_id_str',
 'coordinates',
 'geo',
 'scopes']

#### Selecting the columns that are being dropped

1. Edit the previous list and comment 'created_at', 'id', 'lang', 'text' and 'user';
2. Update the following command with the updated list before running it.

In [9]:
df_tweets_raw_data = df_tweets_raw_data.drop(columns=[
#    'created_at',
    'entities',
    'favorite_count',
    'favorited',
    'filter_level',
#    'id',
    'id_str',
    'is_quote_status',
#    'lang',
    'possibly_sensitive',
    'retweet_count',
    'retweeted',
    'retweeted_status',
    'source',
#    'text',
    'timestamp_ms',
    'truncated',
#    'user',
    'in_reply_to_screen_name',
    'in_reply_to_status_id',
    'in_reply_to_status_id_str',
    'in_reply_to_user_id',
    'in_reply_to_user_id_str',
    'extended_entities',
    'place',
    'quoted_status',
    'quoted_status_id',
    'quoted_status_id_str',
    'coordinates',
    'geo',
    'scopes'
])

In [10]:
df_tweets_raw_data.columns.values.tolist()

['created_at', 'id', 'lang', 'text', 'user']

### Listing the values of the parameter `lang`

In [11]:
df_tweets_raw_data['lang'].unique()

array(['es', 'pt', 'en', 'tl', 'fr', 'und', 'it', 'in', 'nl', 'sv', 'ru',
       'ja', 'de', 'ro', 'fi', 'no', 'hi', 'ht', 'lt', 'et', 'eu', 'pl',
       'da', 'cy', 'cs', 'tr', 'ko', 'lv', 'is', 'hu', 'uk', 'sl', 'vi',
       'ar', 'ur', 'el', 'th'], dtype=object)

### Keeping only the tweets in Portuguese

In [12]:
df_tweets_raw_data = df_tweets_raw_data[df_tweets_raw_data['lang'] == 'pt'].reset_index(drop=True)

### Extracting the column `username`

In [13]:
# Flatten the nested JSON 'user' attribute
df_tweets_raw_data_flattened_user = pd.json_normalize(df_tweets_raw_data['user'])

# Extract the 'screen_name' attribute
username = df_tweets_raw_data_flattened_user['screen_name']

# Create a new column 'username'
df_tweets_raw_data['username'] = username

### Extracting the column `author_id`

In [14]:
# Extract the 'id_str' attribute
author_id = df_tweets_raw_data_flattened_user['id_str']

# Create a new column 'username'
df_tweets_raw_data['author_id'] = author_id

### Extracting the column `tweet_url`

In [15]:
# Construct the tweet URL using the tweet ID and user's screen name
df_tweets_raw_data['tweet_url'] = (
    'https://twitter.com/' + 
    df_tweets_raw_data['username'] + 
    '/status/' + 
    df_tweets_raw_data['id']
)

In [16]:
df_tweets_raw_data

Unnamed: 0,created_at,id,lang,text,user,username,author_id,tweet_url
0,2016-01-15 14:24:23+00:00,688003802478297088,pt,@mitcha é um problema na córnea. Deixa ela em ...,"{'contributors_enabled': False, 'created_at': ...",mosana,26246548,https://twitter.com/mosana/status/688003802478...
1,2016-01-15 14:25:19+00:00,688004037355155456,pt,nunca desconte seus problemas em alguém isso s...,"{'contributors_enabled': False, 'created_at': ...",jdbfuckzz,550458673,https://twitter.com/jdbfuckzz/status/688004037...
2,2016-01-15 14:25:40+00:00,688004125452288000,pt,"RT @relatojovens: Teremos problemas, mas nunca...","{'contributors_enabled': False, 'created_at': ...",WorldOfJovens,474149903,https://twitter.com/WorldOfJovens/status/68800...
3,2016-01-15 14:26:09+00:00,688004247070334976,pt,Vocês já pararam pra pensar que muitas vezes a...,"{'contributors_enabled': False, 'created_at': ...",thisismarcela_,785396354,https://twitter.com/thisismarcela_/status/6880...
4,2016-01-15 14:26:46+00:00,688004402259607554,pt,RT @proundmatt: o problema da dor e que ela pr...,"{'contributors_enabled': False, 'created_at': ...",DaCat01,1713342241,https://twitter.com/DaCat01/status/68800440225...
...,...,...,...,...,...,...,...,...
50170,2016-01-15 20:33:09+00:00,688096605665275905,pt,Rússia e EUA conversam sobre Ucrânia na fronte...,"{'contributors_enabled': False, 'created_at': ...",AtualNoticia,513533519,https://twitter.com/AtualNoticia/status/688096...
50171,2016-01-19 13:14:07+00:00,689435670750765057,pt,"Eu sou do tipo q adora o perigo, q metade eh j...","{'contributors_enabled': False, 'created_at': ...",ingridchris123,3568450575,https://twitter.com/ingridchris123/status/6894...
50172,2016-01-28 12:45:01+00:00,692689838286442496,pt,RT @g1: Blocos tradicionais do Rio cancelam de...,"{'contributors_enabled': False, 'created_at': ...",BiaRosenburg,41925915,https://twitter.com/BiaRosenburg/status/692689...
50173,2016-01-19 13:08:36+00:00,689434282444521472,pt,Twitter deu problema ? É isso minha gente..?,"{'contributors_enabled': False, 'created_at': ...",ooi_pri,269960285,https://twitter.com/ooi_pri/status/68943428244...


### Inspecting the data

In [17]:
inspected_row = 160
print('username:' + df_tweets_raw_data.loc[inspected_row, 'username'])
print('text:' + df_tweets_raw_data.loc[inspected_row, 'text'])
print('tweet_url:' + df_tweets_raw_data.loc[inspected_row, 'tweet_url'])

username:rodrigomanzatto
text:Só me arruma problema ele kkkkkkk
tweet_url:https://twitter.com/rodrigomanzatto/status/687648075184226309


### Creating the output file

In [18]:
df_tweets_raw_data[['created_at', 'author_id', 'username', 'tweet_url', 'text']].to_json(output_filename, orient='records', lines=True)

### Creating the `.tsv` output file (deprecated)

In [None]:
#df_tweets_raw_data[['created_at', 'author_id', 'username', 'tweet_url', 'text']].to_csv('mari20192020.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')