<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1_3 - Querem

This solution calculates the figures that describe the design of the Target Corpus:

- The `Year` when the tweets were posted;
- The `Quantity of tweets per year`;
- The `Quantity of lemmas per year`.

## Required Python packages

- pandas

## Importing the required libraries

In [1]:
import pandas as pd

## Defining input variables

In [2]:
input_directory = 'sas'

## Importing `sas/wcount.txt` into a DataFrame

In [3]:
# Defining the column names
column_names = ['Text ID', 'Quantity of Words']

# Importing the file
df_wcount = pd.read_csv(f"{input_directory}/wcount.txt", sep=' ', header=None, names=column_names)

In [4]:
df_wcount

Unnamed: 0,Text ID,Quantity of Words
0,t000000,17
1,t000001,12
2,t000002,16
3,t000003,22
4,t000004,18
...,...,...
19060,t019060,21
19061,t019061,22
19062,t019062,23
19063,t019063,14


In [5]:
df_wcount.dtypes

Text ID              object
Quantity of Words     int64
dtype: object

### Checking how many rows are equal to zero and to NaN in all of the columns

In [6]:
# Counting rows where the columns are equal to zero
zero_count = (df_wcount == 0).sum()

# Counting rows where the columns are equal to NaN
nan_count = df_wcount.isna().sum()

print(f"Rows where the columns are equal to zero: \n{zero_count}")
print(f"Rows where the columns are equal to NaN: \n{nan_count}")

Rows where the columns are equal to zero: 
Text ID              0
Quantity of Words    0
dtype: int64
Rows where the columns are equal to NaN: 
Text ID              0
Quantity of Words    0
dtype: int64


## Importing `sas/dates.txt` into a DataFrame

In [7]:
# Defining the column names
column_names = ['Text ID', 'Year', 'Month', 'Day']

# Importing the file
df_dates = pd.read_csv(f"{input_directory}/dates.txt", sep=' ', header=None, names=column_names)

In [8]:
df_dates

Unnamed: 0,Text ID,Year,Month,Day
0,t000000,2016,10,13
1,t000001,2016,10,13
2,t000002,2016,10,25
3,t000003,2016,10,13
4,t000004,2016,10,14
...,...,...,...,...
19060,t019060,2022,9,20
19061,t019061,2022,9,5
19062,t019062,2022,9,11
19063,t019063,2022,9,14


In [9]:
df_dates.dtypes

Text ID    object
Year        int64
Month       int64
Day         int64
dtype: object

### Checking how many rows are equal to zero and to NaN in all of the columns

In [10]:
# Counting rows where the columns are equal to zero
zero_count = (df_dates == 0).sum()

# Counting rows where the columns are equal to NaN
nan_count = df_dates.isna().sum()

print(f"Rows where the columns are equal to zero: \n{zero_count}")
print(f"Rows where the columns are equal to NaN: \n{nan_count}")

Rows where the columns are equal to zero: 
Text ID    0
Year       0
Month      0
Day        0
dtype: int64
Rows where the columns are equal to NaN: 
Text ID    0
Year       0
Month      0
Day        0
dtype: int64


## Combining DataFrame `df_dates` and `df_wcount` into DataFrame `df_corpus_design`

In [11]:
# Merging the DataFrames on 'Text ID'
merged_df = pd.merge(df_dates, df_wcount, on='Text ID')

# Grouping by 'Year' to get 'Quantity of Texts' and 'Quantity of Words'
df_corpus_design = merged_df.groupby('Year').agg({
    'Text ID': 'count',
    'Quantity of Words': 'sum'
}).reset_index()

# Renaming columns to match the requirements
df_corpus_design.rename(columns={'Year': 'Ano'}, inplace=True)
df_corpus_design.rename(columns={'Text ID': 'Tuítes'}, inplace=True)
df_corpus_design.rename(columns={'Quantity of Words': 'Formas lexicais'}, inplace=True)

In [12]:
df_corpus_design

Unnamed: 0,Ano,Tuítes,Formas lexicais
0,2016,1802,28730
1,2017,2189,35696
2,2018,2799,49045
3,2019,5193,91684
4,2020,2146,37970
5,2021,1677,29501
6,2022,3259,58330


In [13]:
total_texts = df_corpus_design['Tuítes'].sum()
total_words = df_corpus_design['Formas lexicais'].sum()

In [14]:
print(f"O total de tuítes é {total_texts} e o total de formas lexicais é {total_words}.")

O total de tuítes é 19065 e o total de formas lexicais é 330956.


## Appendices

`df_tweets` ended up not needed in the solution. The process of importing it into a DataFrame is kept here for future reference.

### Defining input variables

In [15]:
input_directory1 = 'tweets'

### Importing `tweets/tweets.txt` into a DataFrame

In [16]:
# Defining the column names
column_names = ['Text ID', 'Conversation', 'Date', 'User', 'Content']

# Importing the file
df_tweets = pd.read_csv(f"{input_directory1}/tweets.txt", sep='|', header=None, names=column_names)

### Checking how many rows are equal to zero and to NaN in all of the columns

In [17]:
# Counting rows where the columns are equal to zero
zero_count = (df_tweets == 0).sum()

# Counting rows where the columns are equal to NaN
nan_count = df_tweets.isna().sum()

print(f"Rows where the columns are equal to zero: \n{zero_count}")
print(f"Rows where the columns are equal to NaN: \n{nan_count}")

Rows where the columns are equal to zero: 
Text ID         0
Conversation    0
Date            0
User            0
Content         0
dtype: int64
Rows where the columns are equal to NaN: 
Text ID         0
Conversation    0
Date            0
User            0
Content         0
dtype: int64


### Extracting the `Year` column from the `Date` column

In [18]:
# Extract year using RegEx
df_tweets['Year'] = df_tweets['Date'].str.extract(r'^d:(\d{4})-\d{2}-\d{2}')

In [19]:
df_tweets

Unnamed: 0,Text ID,Conversation,Date,User,Content,Year
0,t000000,v:2316329808,d:2016-10-13,u:SrtaXiss,c:RT @BR_DeTodos200Mi: Grupo distribui com...,2016
1,t000001,v:4876348647,d:2016-10-13,u:ins_ana_,c:RT @correio_dopovo: Roraima prepara gabi...,2016
2,t000002,v:2858025838,d:2016-10-25,u:ireneravachero3,"c:E quanto a situação dos Venezuelanos ,...",2016
3,t000003,v:457243275,d:2016-10-13,u:vitor_CRVG,c:@Estadao queria saber se o diretório d...,2016
4,t000004,v:1944741320,d:2016-10-14,u:Camisa13doGalo,c:Rómulo Otero celebra resultado e boa a...,2016
...,...,...,...,...,...,...
19060,t019060,v:1506118493938270208,d:2022-09-20,u:Jmalvesdc,"c:RT @BoicaIslene: Victor Lucchesi , expõ...",2022
19061,t019061,v:2904143747,d:2022-09-05,u:cjcastro45,c:RT @Pattypschmidt: Hoje conheci Denisse ...,2022
19062,t019062,v:1547288191731900416,d:2022-09-11,u:LimaFucuta,"c:@VEJA Cruz credo , o regime que o pt...",2022
19063,t019063,v:123496655,d:2022-09-14,u:EMBRAC,c:RT @DiarioPE: Justiça argentina autoriza...,2022


In [20]:
df_tweets.dtypes

Text ID         object
Conversation    object
Date            object
User            object
Content         object
Year            object
dtype: object