<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1_2 - Andre

This solution calculates the figures that describe the design of the Target Corpus:

- The `Year` when the tweets were posted;
- The `Quantity of tweets per year`;
- The `Quantity of lemmas per year`.

## Required Python packages

- pandas

## Importing the required libraries

In [1]:
import pandas as pd

## Defining input variables

In [2]:
input_directory = 'sas'

## Importing `sas/wcount.txt` into a DataFrame

In [3]:
# Defining the column names
column_names = ['Text ID', 'Quantity of Words']

# Importing the file
df_wcount = pd.read_csv(f"{input_directory}/wcount.txt", sep=' ', header=None, names=column_names)

In [4]:
df_wcount

Unnamed: 0,Text ID,Quantity of Words
0,t000000,30
1,t000001,11
2,t000002,24
3,t000003,21
4,t000004,20
...,...,...
2078,t002078,24
2079,t002079,12
2080,t002080,14
2081,t002081,15


In [5]:
df_wcount.dtypes

Text ID              object
Quantity of Words     int64
dtype: object

### Checking how many rows are equal to zero and to NaN in all of the columns

In [6]:
# Counting rows where the columns are equal to zero
zero_count = (df_wcount == 0).sum()

# Counting rows where the columns are equal to NaN
nan_count = df_wcount.isna().sum()

print(f"Rows where the columns are equal to zero: \n{zero_count}")
print(f"Rows where the columns are equal to NaN: \n{nan_count}")

Rows where the columns are equal to zero: 
Text ID              0
Quantity of Words    0
dtype: int64
Rows where the columns are equal to NaN: 
Text ID              0
Quantity of Words    0
dtype: int64


## Importing `sas/dates.txt` into a DataFrame

In [7]:
# Defining the column names
column_names = ['Text ID', 'Year', 'Month', 'Day']

# Importing the file
df_dates = pd.read_csv(f"{input_directory}/dates.txt", sep=' ', header=None, names=column_names)

In [8]:
df_dates

Unnamed: 0,Text ID,Year,Month,Day
0,t000000,2022,11,25
1,t000001,2022,11,25
2,t000002,2022,11,24
3,t000003,2022,11,23
4,t000004,2022,11,23
...,...,...,...,...
2078,t002078,2021,7,14
2079,t002079,2021,7,14
2080,t002080,2021,7,14
2081,t002081,2021,7,13


In [9]:
df_dates.dtypes

Text ID    object
Year        int64
Month       int64
Day         int64
dtype: object

### Checking how many rows are equal to zero and to NaN in all of the columns

In [10]:
# Counting rows where the columns are equal to zero
zero_count = (df_dates == 0).sum()

# Counting rows where the columns are equal to NaN
nan_count = df_dates.isna().sum()

print(f"Rows where the columns are equal to zero: \n{zero_count}")
print(f"Rows where the columns are equal to NaN: \n{nan_count}")

Rows where the columns are equal to zero: 
Text ID    0
Year       0
Month      0
Day        0
dtype: int64
Rows where the columns are equal to NaN: 
Text ID    0
Year       0
Month      0
Day        0
dtype: int64


## Combining DataFrame `df_dates` and `df_wcount` into DataFrame `df_corpus_design`

In [11]:
# Merging the DataFrames on 'Text ID'
merged_df = pd.merge(df_dates, df_wcount, on='Text ID')

# Grouping by 'Year' to get 'Quantity of Texts' and 'Quantity of Words'
df_corpus_design = merged_df.groupby('Year').agg({
    'Text ID': 'count',
    'Quantity of Words': 'sum'
}).reset_index()

# Renaming columns to match the requirements
df_corpus_design.rename(columns={'Year': 'Ano'}, inplace=True)
df_corpus_design.rename(columns={'Text ID': 'Tuítes'}, inplace=True)
df_corpus_design.rename(columns={'Quantity of Words': 'Formas lexicais'}, inplace=True)

In [12]:
df_corpus_design

Unnamed: 0,Ano,Tuítes,Formas lexicais
0,2011,1,8
1,2015,1,13
2,2018,11,319
3,2019,16,463
4,2020,141,3509
5,2021,951,19581
6,2022,962,21560


In [13]:
total_texts = df_corpus_design['Tuítes'].sum()
total_words = df_corpus_design['Formas lexicais'].sum()

In [14]:
print(f"O total de tuítes é {total_texts} e o total de formas lexicais é {total_words}.")

O total de tuítes é 2083 e o total de formas lexicais é 45453.


## Appendices

`df_tweets` ended up not needed in the solution. The process of importing it into a DataFrame is kept here for future reference.

### Defining input variables

In [15]:
input_directory1 = 'tweets'

### Importing `tweets/tweets.txt` into a DataFrame

In [16]:
# Defining the column names
column_names = ['Text ID', 'Conversation', 'Date', 'User', 'Content']

# Importing the file
df_tweets = pd.read_csv(f"{input_directory1}/tweets.txt", sep='|', header=None, names=column_names)

### Checking how many rows are equal to zero and to NaN in all of the columns

In [17]:
# Counting rows where the columns are equal to zero
zero_count = (df_tweets == 0).sum()

# Counting rows where the columns are equal to NaN
nan_count = df_tweets.isna().sum()

print(f"Rows where the columns are equal to zero: \n{zero_count}")
print(f"Rows where the columns are equal to NaN: \n{nan_count}")

Rows where the columns are equal to zero: 
Text ID         0
Conversation    0
Date            0
User            0
Content         0
dtype: int64
Rows where the columns are equal to NaN: 
Text ID         0
Conversation    0
Date            0
User            0
Content         0
dtype: int64


### Extracting the `Year` column from the `Date` column

In [18]:
# Extract year using RegEx
df_tweets['Year'] = df_tweets['Date'].str.extract(r'^d:(\d{4})-\d{2}-\d{2}')

In [19]:
df_tweets

Unnamed: 0,Text ID,Conversation,Date,User,Content,Year
0,t000000,v:1587172706306473989,d:2022-11-25,u:Luciane32890118,"c:@leandroruschel Não , pois nossos direi...",2022
1,t000001,v:820617853,d:2022-11-25,u:deqqlugar,"c:Retroceder , jamais ! HASHTAG #eleicoes...",2022
2,t000002,v:1151289025694699521,d:2022-11-24,u:Erikalrb86,c:A dit^dura da toga continua . . . HA...,2022
3,t000003,v:4128607853,d:2022-11-23,u:RaphaelRossiter,"c:Queremos respostas , queremos transparên...",2022
4,t000004,v:4128607853,d:2022-11-23,u:RaphaelRossiter,c:@estadaoverifica @TSEjusbr Queremos respo...,2022
...,...,...,...,...,...,...
2078,t002078,v:1256657036827295744,d:2021-07-14,u:Norma25714352,c:@DeputadoFederal Uhuu tomara que seja o...,2021
2079,t002079,v:1256657036827295744,d:2021-07-14,u:Norma25714352,c:@ClaraFernanndez HASHTAG #danielsilveirapr...,2021
2080,t002080,v:1256657036827295744,d:2021-07-14,u:Norma25714352,c:@flferronato Uma vergonha HASHTAG #danie...,2021
2081,t002081,v:1256657036827295744,d:2021-07-13,u:Norma25714352,c:@OmarAzizSenador HASHTAG #danielsilveirapr...,2021


In [20]:
df_tweets.dtypes

Text ID         object
Conversation    object
Date            object
User            object
Content         object
Year            object
dtype: object