<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Mariana - Dataset preparation

## Tweet Object documentation

Please refer to [Tweet Object](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet)

## Importing the required libraries

In [1]:
import pandas as pd

## Data wrangling

### Importing the tweet raw data into a dataframe

In [2]:
df_tweets_raw_data = pd.read_json('mari201903.jsonl', lines=True)

In [3]:
df_tweets_raw_data.head(5)

Unnamed: 0,created_at,entities,favorite_count,favorited,filter_level,id,id_str,is_quote_status,lang,quote_count,...,quoted_status_permalink,display_text_range,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,coordinates,geo,place
0,2019-03-31 15:05:50+00:00,"{'hashtags': [{'indices': [17, 27], 'text': 'V...",0,False,low,1112370424334172161,1112370424334172160,False,es,0,...,,,,,,,,,,
1,2019-03-17 10:56:42+00:00,"{'hashtags': [{'indices': [122, 128], 'text': ...",0,False,low,1107234297780408321,1107234297780408320,False,es,0,...,,,,,,,,,,
2,2019-03-31 15:06:07+00:00,"{'hashtags': [{'indices': [28, 38], 'text': 'V...",0,False,low,1112370495645724673,1112370495645724672,False,en,0,...,,,,,,,,,,
3,2019-03-31 15:06:07+00:00,"{'hashtags': [{'indices': [69, 79], 'text': 'V...",0,False,low,1112370495624753154,1112370495624753152,False,es,0,...,,,,,,,,,,
4,2019-03-31 15:02:07+00:00,"{'hashtags': [{'indices': [18, 28], 'text': 'V...",0,False,low,1112369489021034496,1112369489021034496,False,es,0,...,,,,,,,,,,


In [4]:
df_tweets_raw_data.shape

(32982, 35)

### Checking if data types are consistent

In [5]:
df_tweets_raw_data.dtypes

created_at                   datetime64[ns, UTC]
entities                                  object
favorite_count                             int64
favorited                                   bool
filter_level                              object
id                                         int64
id_str                                     int64
is_quote_status                             bool
lang                                      object
quote_count                                int64
reply_count                                int64
retweet_count                              int64
retweeted                                   bool
retweeted_status                          object
source                                    object
text                                      object
timestamp_ms                      datetime64[ns]
truncated                                   bool
user                                      object
extended_entities                         object
possibly_sensitive  

#### Converting `id` column's data type to `str` for future use

Note: For some unknown reason, pandas has imported the attribute `id_str` incorrectly in some cases. Therefore, `id` is  being used instead.

In [6]:
df_tweets_raw_data['id'] = df_tweets_raw_data['id'].astype(str)

### Dropping unnecessary columns

#### Listing the columns

In [7]:
df_tweets_raw_data.columns.values.tolist()

['created_at',
 'entities',
 'favorite_count',
 'favorited',
 'filter_level',
 'id',
 'id_str',
 'is_quote_status',
 'lang',
 'quote_count',
 'reply_count',
 'retweet_count',
 'retweeted',
 'retweeted_status',
 'source',
 'text',
 'timestamp_ms',
 'truncated',
 'user',
 'extended_entities',
 'possibly_sensitive',
 'extended_tweet',
 'quoted_status',
 'quoted_status_id',
 'quoted_status_id_str',
 'quoted_status_permalink',
 'display_text_range',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id',
 'in_reply_to_user_id_str',
 'coordinates',
 'geo',
 'place']

#### Selecting the columns that are being dropped

In [9]:
df_tweets_raw_data = df_tweets_raw_data.drop(columns=[
#    'created_at',
    'entities',
    'favorite_count',
    'favorited',
    'filter_level',
#    'id',
    'id_str',
    'is_quote_status',
#    'lang',
    'quote_count',
    'reply_count',
    'retweet_count',
    'retweeted',
    'retweeted_status',
    'source',
#    'text',
    'timestamp_ms',
    'truncated',
#    'user',
    'quoted_status',
    'quoted_status_id',
    'quoted_status_id_str',
    'quoted_status_permalink',
    'extended_tweet',
    'display_text_range',
    'extended_entities',
    'possibly_sensitive',
    'in_reply_to_screen_name',
    'in_reply_to_user_id',
    'in_reply_to_user_id_str',
    'in_reply_to_status_id',
    'in_reply_to_status_id_str',
    'coordinates',
    'geo',
    'place',
#    'withheld_in_countries'
])

In [10]:
df_tweets_raw_data.columns.values.tolist()

['created_at', 'id', 'lang', 'text', 'user']

### Listing the values of the parameter `lang`

In [11]:
df_tweets_raw_data['lang'].unique()

array(['es', 'en', 'th', 'und', 'fi', 'lt', 'fr', 'ca', 'pt', 'de', 'in',
       'sv', 'it', 'tr', 'tl', 'zh', 'pl', 'ar', 'nl', 'ro', 'et', 'no',
       'da', 'eu', 'hu', 'ja', 'hi', 'ht', 'iw', 'ta', 'el'], dtype=object)

### Keeping only the tweets in Portuguese

In [12]:
df_tweets_raw_data = df_tweets_raw_data[df_tweets_raw_data['lang'] == 'pt'].reset_index(drop=True)

### Extracting the column `username`

In [13]:
# Flatten the nested JSON 'user' attribute
df_tweets_raw_data_flattened_user = pd.json_normalize(df_tweets_raw_data['user'])

# Extract the 'screen_name' attribute
username = df_tweets_raw_data_flattened_user['screen_name']

# Create a new column 'username'
df_tweets_raw_data['username'] = username

### Extracting the column `author_id`

In [14]:
# Extract the 'id_str' attribute
author_id = df_tweets_raw_data_flattened_user['id_str']

# Create a new column 'username'
df_tweets_raw_data['author_id'] = author_id

### Extracting the column `tweet_url`

In [15]:
# Construct the tweet URL using the tweet ID and user's screen name
df_tweets_raw_data['tweet_url'] = (
    'https://twitter.com/' + 
    df_tweets_raw_data['username'] + 
    '/status/' + 
    df_tweets_raw_data['id']
)

In [16]:
df_tweets_raw_data

Unnamed: 0,created_at,id,lang,text,user,username,author_id,tweet_url
0,2019-03-02 15:06:35+00:00,1101861364975439873,pt,RT @RenovaMidia: O presidente interino da #Ven...,"{'contributors_enabled': False, 'created_at': ...",markmct1,2830603619,https://twitter.com/markmct1/status/1101861364...
1,2019-03-16 12:17:08+00:00,1106892151605088256,pt,RT @RenovaMidia: #SemanaRENOVA\n\nEnquanto o p...,"{'contributors_enabled': False, 'created_at': ...",MonRuivaru,707710392835903488,https://twitter.com/MonRuivaru/status/11068921...
2,2019-03-21 14:39:13+00:00,1108739847395725314,pt,RT @maibortpetit: A esta hora #OEA #Venezuela ...,"{'contributors_enabled': False, 'created_at': ...",nidiarey1,758788578570801152,https://twitter.com/nidiarey1/status/110873984...
3,2019-03-02 14:44:32+00:00,1101855815932305408,pt,RT @RenovaMidia: #SemanaRENOVA\n\nJuiz adjunto...,"{'contributors_enabled': False, 'created_at': ...",jdsbobmarley,189735759,https://twitter.com/jdsbobmarley/status/110185...
4,2019-03-16 16:24:31+00:00,1106954407680331776,pt,RT @RenovaMidia: Além de esvaziar as riquezas ...,"{'contributors_enabled': False, 'created_at': ...",soudanovaera,951093579652517888,https://twitter.com/soudanovaera/status/110695...
...,...,...,...,...,...,...,...,...
157,2019-03-04 23:47:44+00:00,1102717292398415873,pt,RT @zoemaria_5: CONFIRMADO| O regime cubano es...,"{'contributors_enabled': False, 'created_at': ...",MORNINGSTARTO,1075465715799482371,https://twitter.com/MORNINGSTARTO/status/11027...
158,2019-03-05 05:24:39+00:00,1102802080245391360,pt,Os verdadeiros invasores da #Venezuela @ernest...,"{'contributors_enabled': False, 'created_at': ...",democraciaelib1,2828776396,https://twitter.com/democraciaelib1/status/110...
159,2019-03-29 07:37:30+00:00,1111532821816979456,pt,RT @RenovaMidia: O presidente interino da #Ven...,"{'contributors_enabled': False, 'created_at': ...",FlvioPinheiro14,1082786536050122753,https://twitter.com/FlvioPinheiro14/status/111...
160,2019-03-31 07:29:25+00:00,1112255563323359233,pt,RT @HusseinBrasil: @jairbolsonaro @BolsonaroSP...,"{'contributors_enabled': False, 'created_at': ...",imeldareyna46,163301786,https://twitter.com/imeldareyna46/status/11122...


### Inspecting the data

In [17]:
inspected_row = 160
print('username:' + df_tweets_raw_data.loc[inspected_row, 'username'])
print('text:' + df_tweets_raw_data.loc[inspected_row, 'text'])
print('tweet_url:' + df_tweets_raw_data.loc[inspected_row, 'tweet_url'])

username:imeldareyna46
text:RT @HusseinBrasil: @jairbolsonaro @BolsonaroSP @tarcisiogdf #Venezuela 

O popular Guaidó fora da redoma.

https://t.co/7RDhBD34MR
tweet_url:https://twitter.com/imeldareyna46/status/1112255563323359233


### Creating the file `mari20192020.tsv`

In [18]:
df_tweets_raw_data[['created_at', 'author_id', 'username', 'tweet_url', 'text']].to_csv('mari20192020.tsv', sep='\t', index=False, encoding='utf-8', lineterminator='\n')