<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 1_3 - Querem

This solution calculates the figures that describe the design of the Target Corpus:

- The `Year` when the tweets were posted;
- The `Quantity of tweets per year`;
- The `Quantity of lemmas per year`.

## Required Python packages

- pandas

## Importing the required libraries

In [1]:
import pandas as pd

## Defining input variables

In [2]:
input_directory = 'sas'

## Importing `sas/wcount.txt` into a DataFrame

In [3]:
# Defining the column names
column_names = ['Text ID', 'Quantity of Words']

# Importing the file
df_wcount = pd.read_csv(f"{input_directory}/wcount.txt", sep=' ', header=None, names=column_names)

In [4]:
df_wcount

Unnamed: 0,Text ID,Quantity of Words
0,t000000,47
1,t000001,37
2,t000002,42
3,t000003,43
4,t000004,22
...,...,...
20596,t020596,23
20597,t020597,51
20598,t020598,51
20599,t020599,67


In [5]:
df_wcount.dtypes

Text ID              object
Quantity of Words     int64
dtype: object

### Checking how many rows are equal to zero and to NaN in all of the columns

In [6]:
# Counting rows where the columns are equal to zero
zero_count = (df_wcount == 0).sum()

# Counting rows where the columns are equal to NaN
nan_count = df_wcount.isna().sum()

print(f"Rows where the columns are equal to zero: \n{zero_count}")
print(f"Rows where the columns are equal to NaN: \n{nan_count}")

Rows where the columns are equal to zero: 
Text ID              0
Quantity of Words    0
dtype: int64
Rows where the columns are equal to NaN: 
Text ID              0
Quantity of Words    0
dtype: int64


## Importing `sas/dates.txt` into a DataFrame

In [7]:
# Defining the column names
column_names = ['Text ID', 'Year', 'Month', 'Day']

# Importing the file
df_dates = pd.read_csv(f"{input_directory}/dates.txt", sep=' ', header=None, names=column_names)

In [8]:
df_dates

Unnamed: 0,Text ID,Year,Month,Day
0,t000000,2018,3,28
1,t000001,2018,3,30
2,t000002,2018,3,30
3,t000003,2018,3,28
4,t000004,2018,3,30
...,...,...,...,...
20596,t020596,2023,4,29
20597,t020597,2023,4,29
20598,t020598,2023,4,29
20599,t020599,2023,4,29


In [9]:
df_dates.dtypes

Text ID    object
Year        int64
Month       int64
Day         int64
dtype: object

### Checking how many rows are equal to zero and to NaN in all of the columns

In [10]:
# Counting rows where the columns are equal to zero
zero_count = (df_dates == 0).sum()

# Counting rows where the columns are equal to NaN
nan_count = df_dates.isna().sum()

print(f"Rows where the columns are equal to zero: \n{zero_count}")
print(f"Rows where the columns are equal to NaN: \n{nan_count}")

Rows where the columns are equal to zero: 
Text ID    0
Year       0
Month      0
Day        0
dtype: int64
Rows where the columns are equal to NaN: 
Text ID    0
Year       0
Month      0
Day        0
dtype: int64


## Combining DataFrame `df_dates` and `df_wcount` into DataFrame `df_corpus_design`

In [11]:
# Merging the DataFrames on 'Text ID'
merged_df = pd.merge(df_dates, df_wcount, on='Text ID')

# Grouping by 'Year' to get 'Quantity of Texts' and 'Quantity of Words'
df_corpus_design = merged_df.groupby('Year').agg({
    'Text ID': 'count',
    'Quantity of Words': 'sum'
}).reset_index()

# Renaming columns to match the requirements
df_corpus_design.rename(columns={'Year': 'Ano'}, inplace=True)
df_corpus_design.rename(columns={'Text ID': 'Tuítes'}, inplace=True)
df_corpus_design.rename(columns={'Quantity of Words': 'Formas lexicais'}, inplace=True)

In [12]:
df_corpus_design

Unnamed: 0,Ano,Tuítes,Formas lexicais
0,2016,6,115
1,2017,5,131
2,2018,2641,77425
3,2019,3064,90223
4,2020,4034,120599
5,2021,4095,120869
6,2022,5358,156241
7,2023,1398,40899


In [13]:
total_texts = df_corpus_design['Tuítes'].sum()
total_words = df_corpus_design['Formas lexicais'].sum()

In [14]:
print(f"O total de tuítes é {total_texts} e o total de formas lexicais é {total_words}.")

O total de tuítes é 20601 e o total de formas lexicais é 606502.


## Appendices

`df_tweets` ended up not needed in the solution. The process of importing it into a DataFrame is kept here for future reference.

### Defining input variables

In [15]:
input_directory1 = 'tweets'

### Importing `tweets/tweets.txt` into a DataFrame

In [16]:
# Defining the column names
column_names = ['Text ID', 'Conversation', 'Date', 'User', 'Content']

# Importing the file
df_tweets = pd.read_csv(f"{input_directory1}/tweets.txt", sep='|', header=None, names=column_names)

### Checking how many rows are equal to zero and to NaN in all of the columns

In [17]:
# Counting rows where the columns are equal to zero
zero_count = (df_tweets == 0).sum()

# Counting rows where the columns are equal to NaN
nan_count = df_tweets.isna().sum()

print(f"Rows where the columns are equal to zero: \n{zero_count}")
print(f"Rows where the columns are equal to NaN: \n{nan_count}")

Rows where the columns are equal to zero: 
Text ID         0
Conversation    0
Date            0
User            0
Content         0
dtype: int64
Rows where the columns are equal to NaN: 
Text ID         0
Conversation    0
Date            0
User            0
Content         0
dtype: int64


### Extracting the `Year` column from the `Date` column

In [18]:
# Extract year using RegEx
df_tweets['Year'] = df_tweets['Date'].str.extract(r'^d:(\d{4})-\d{2}-\d{2}')

In [19]:
df_tweets

Unnamed: 0,Text ID,Conversation,Date,User,Content,Year
0,t000000,v:287765295,d:2018-03-28,u:pelegrini65,"c:Após caluniar , ameaçar , incitar as ...",2018
1,t000001,v:16794066,d:2018-03-30,u:BlogdoNoblat,c:Bolsonaro deve saber o que está fazend...,2018
2,t000002,v:955901617148235776,d:2018-03-30,u:MariaOl25529153,c:@FlavioBolsonaro Mais um Romário na pol...,2018
3,t000003,v:44449830,d:2018-03-28,u:lucianagenro,c:A esquerda não tem conseguido comunicar...,2018
4,t000004,v:912132396,d:2018-03-30,u:rocoguima,c:RT @AurystellaS: @BlogdoNoblat Vc sabe ...,2018
...,...,...,...,...,...,...
20596,t020596,v:1547227306913153026,d:2023-04-29,u:LuccaSo44679209,"c:RT @LuccaSo44679209: @CiresCanisio Não ,...",2023
20597,t020597,v:1547227306913153026,d:2023-04-29,u:LuccaSo44679209,"c:@CiresCanisio Não , o Lula não da uma...",2023
20598,t020598,v:1554492869825683457,d:2023-04-29,u:Andre19lll,c:@eunaovoupararde @CarlosZarattini Os índi...,2023
20599,t020599,v:1585200142440882179,d:2023-04-29,u:priscila19865,c:@ValS265451870 @Guthbsb @marcia_miami Tá ...,2023


In [20]:
df_tweets.dtypes

Text ID         object
Conversation    object
Date            object
User            object
Content         object
Year            object
dtype: object