# Week 9: A Pandas Approach to TTRs in the Colonial South Asian Literature dataset


Having learned how to work with **metadata** in Pandas over the previous two weeks — we didn't look at any actual *text* of literary works; just at data **about** texts — this week we learn how to record findings about textual data in Pandas: by storing it in dataframes (and how to turn lists into dataframes, and to merge those lists). Our example of textual analysis is a familiar one: TTRs.

Then we combine the two approaches. We load the Colonial South Asian Literature (CSAL) dataset (metadata), and then analyze each text represented in the metadata, calculating its overall and standardized TTR. We record these values, then merge them with a dataframe containing the CSAL metadata.

Then we use Pandas to sort, process, and visualize our data, asking whether South Asian or foreign writers had notably divergent TTRs for their discussions of South Asia.

# Loading the CSAL Dataset

Let's begin by loading the CSAL dataset and having a look at what kinds of "metadata" it contains.

In [None]:
import pandas as pd

In [None]:
csal_meta_df = pd.read_csv('csal.csv')

In [None]:
csal_meta_df

In [None]:
csal_meta_df.head(10)

In [None]:
csal_meta_df.info()

In [None]:
csal_meta_df.describe(include="all")

Our task today is to investigate whether the South Asian or "foreign" writers use a higher TTR in their works. So the most important column for us at this point is `Nationality of Author`. Let's have a closer look at what it contains.

In [None]:
csal_meta_df['Nationality of Author'].value_counts()

In [None]:
csal_meta_df['Nationality of Author'].value_counts().plot(kind="pie", figsize=(7, 7))

Let's also have a look at `Genre`, which might be interesting to us in a bit as well...

In [None]:
csal_meta_df['Genre'].value_counts()

In [None]:
csal_meta_df['Genre'].value_counts().plot(kind="pie", figsize=(7, 7))

# Approaching TTR Task in Pandas

Okay, now let's take a step back and re-approach our TTR task with a Pandas mindset. 

When we took on this task in the first half of the semester, we just wanted to output our analysis to a CSV spreadsheet file and consider our task done.

Now that we have been introduced to Pandas, let's instead plan on gathering our results into DataFrames, with the aim of eventually putting new columns into the `csal_meta_df` DataFrame with TTR values for every text: overall types, overall tokens, overall TTR; standardized types, standardized tokens, standardized TTR. Once we have that big DataFrame with those extra columns, we'll be able to do some fancy analysis!

We'll start by recycling our code from a few weeks ago to generate the same CSV files... except this time, we'll load those CSVs back into Python as Pandas dataframes!


## Generating TTR CSV files... and Loading Them Back as Pandas DataFrames

Let's start by using some code directly recycled** from the Week 5 lecture. 

The code below iterates through all the files in the `csal` folder, which contains all the CSAL texts, and generates CSV files for their standard and overall values. There are 110 files in the CSAL dataset, some of them quite large, so this will take a second!

** There are two tiny differences between this code and the code we used in Week 5: I have grabbed the full file name with extension of each text file in the `Text` column, so that it matches what's in the `csal_meta_df` `Text` columns. You'll see why that's important in a second... The other difference is that this code now labels the columns "Overall Types" or "Standardized Types" (etc) for more precise columns labels that will also come in handy later.

### BEFORE RUNNING THE BELOW CELL: It takes a while to run this code, since the CSAL dataset is pretty big, so I have placed the output files of the below cell in your JupyterHubs for demonstration purposes! IF YOU'RE FOLLOWING ALONG IN CLASS, LET'S AVOID CRASHING THE SERVER BY SKIPPING THE BELOW STEP.

In [None]:
import re
from pathlib import Path

folder_path = "csal/" # We're telling the code to look in the "csal/" subfolder, where the CSAL files all live.

sample_size = 0

file = open("ttr-overall.csv", mode="w", encoding="utf-8")

file.write('"Text","Overall Types","Overall Tokens","Overall TTR"\n') # Column labels are more precise, identifying whether the column records "Overall" or "Standardized" values

for file_path in sorted(Path(folder_path).glob('*.txt')):
    
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    tokens = len(text_words)
    
    if sample_size == 0 or tokens < sample_size:
        sample_size = tokens
    
    unique_words = []
    
    for word in text_words:
        word = word.lower()
        if word not in unique_words:
            unique_words.append(word)
            
    types = len(unique_words)
    
    ttr = (types / tokens) * 100
    
    file.write(f'"{file_path.name}",{types},{tokens},{ttr}\n') # path.name used rather than path.stem so that recoreded filenames match CSAL metadata

file.close()



file = open("ttr-standardized.csv", mode="w", encoding="utf-8")

file.write('"Text","Standardized Types","Standardized Tokens","Standardized TTR"\n') # Column labels are more precise, identifying whether the column records "Overall" or "Standardized" values

for file_path in sorted(Path(folder_path).glob('*.txt')):
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    text_words_standardized = text_words[:sample_size]
    tokens_standardized = len(text_words_standardized)

    unique_words_standardized = []
    
    for word in text_words_standardized:
        word = word.lower()
        if word not in unique_words_standardized:
            unique_words_standardized.append(word)
            
    types_standardized = len(unique_words_standardized)
    
    ttr_standardized = (types_standardized / tokens_standardized) * 100
    
    file.write(f'"{file_path.name}",{types_standardized},{tokens_standardized},{ttr_standardized}\n') # path.name used rather than path.stem so that recoreded filenames match CSAL metadata

file.close()

Okay, that has left us with our familiar `ttr-overall.csv` and `ttr-standardized.csv` results files. 

Let's use our old friend `pd.read_csv()` to load each of those newly created CSV files as Pandas DataFrames!

In [None]:
overall_ttr_df = pd.read_csv("ttr-overall.csv")

In [None]:
overall_ttr_df

In [None]:
standardized_ttr_df = pd.read_csv("ttr-standardized.csv")

In [None]:
standardized_ttr_df

# Merging DataFrames

Okay, let's say that instead of having our overall and standardized TTR values in separate DataFrames, we wanted to **merge** them into a single DataFrame that contains all the relevant data.

Well, we can do that quite easily with Pandas's `.merge()` method. We can only merge DataFrames that contain one column in common — otherwise, Pandas won't know exactly how to combine them. But thankfully our DataFrames do have one column in common: `Text`.

In [None]:
overall_ttr_df

In [None]:
standardized_ttr_df

Below is the command we use to `.merge()` our two DataFrames, **"on"** the column they have in common. 

In [None]:
pd.merge(overall_ttr_df, standardized_ttr_df, on="Text")

Now let's go ahead and stick that into a variable

In [None]:
ttr_df = pd.merge(overall_ttr_df, standardized_ttr_df, on="Text")

In [None]:
ttr_df

# Merging the TTR Data with the CSAL Metadata

It's worth remembering at this time that the CSAL Metadata (currently stored in `csal_meta_df`) also contains that same `Text` column — and so we can also create a mega-merged DataFrame that contains all the CSAL metadata and all the TTR analysis we've just done. This will allow us to analyze our TTRs by our various metadata categories, including author nationality.

In [None]:
csal_ttr_df = pd.merge(csal_meta_df, ttr_df, on="Text")

In [None]:
csal_ttr_df

Let's learn a little more about this new mega-DataFrame we're created...

In [None]:
csal_ttr_df.describe(include="all")

This is one of those occasions when `include="all"` parameter on the `df.describe()` method gives us more info that we really want. Let's try again without, which will only give us the "greatest hits" columns...

In [None]:
csal_ttr_df.describe()

# Sorting by Column

Before we jump into our actual task for this week, let's see how you would sort the full dataset by Standardized TTR, from lowest to highest; then from highest to lowest.

In [None]:
csal_ttr_df.sort_values(by='Standardized TTR', ascending=True)

In [None]:
csal_ttr_df.sort_values(by='Standardized TTR', ascending=False)

# Using GroupBy and Mean to Get Our TTR-by-Nationality Data

Now that we have this mega-DataFrame — it contains all the CSAL metadata, and all our precious TTR data — we can pursue our original research question: do texts written by authors from the subcontinent have higher or lower TTRs than texts written by authors identified as foreign?

**What data do we actually need to see, in what format, to pursue that research question?**

Let's start by using our old friend `df.groupby()` and group this DataFrame by the `Nationality of Author` column.

In [None]:
csal_by_nationality_df = csal_ttr_df.groupby('Nationality of Author')
csal_by_nationality_df

DataFrames produced by GroupBy can't be visualized in the standard way that normal DataFrames are. We need to call methods on them to see what's inside. Remember what we're looking for: the **mean standardized TTR for each category of author nationality**. If we just call on old reliable `df.describe()`, we can see that this data is already the `csal_by_nationality_df` DataFrame we just produced. Do you see where it is in the below output?

In [None]:
csal_by_nationality_df.describe()

Here's how we grab only the information we want from `csal_by_nationality_df` — subsetting to the `Standardized TTR` column (using a method we've been using for a few weeks now — passing a `['list containing a single string']` into the `dataframe[ ]` structure) and then calling the Pandas `.mean()` function on that column.

What we get from this is just a plain old Pandas DataFrame (not a GroupBy object)

In [None]:
csal_by_nationality_df[['Standardized TTR']].mean()

In [None]:
type(csal_by_nationality_df[['Standardized TTR']].mean())

Now let's stick that into a variable... and let's make a plot of the data we've uncovered... and then interpret the results together!

In [None]:
mean_ttr_by_nationality_df = csal_by_nationality_df[['Standardized TTR']].mean()

In [None]:
mean_ttr_by_nationality_df.plot(kind='bar', figsize=(10,5), title='Standardized TTRs Averaged Across Nationality of Author')

Let's now look at similar plots for TTR data sorted according to different metadata categories, using the same methods employed above. Does this give you any further insight into the results above?

In [None]:
csal_ttr_by_year_df = csal_ttr_df.groupby('Year')
mean_ttr_by_year_df = csal_ttr_by_year_df[['Standardized TTR']].mean()
mean_ttr_by_year_df

In [None]:
mean_ttr_by_year_df.plot(figsize=(15,5), title='Standardized TTRs Averaged Across Year of Publication')

In [None]:
mean_ttr_by_genre_df = csal_ttr_df.groupby('Genre')[['Standardized TTR']].mean()
mean_ttr_by_genre_df

In [None]:
mean_ttr_by_genre_df.plot(kind='bar', figsize=(10,5), title='Standardized TTRs Averaged Across Genre of Text')

Let's use the techniques we learned last time to produce our gender signal-by-year plots to see exactly how many works in each Genre appear for each of the author nationalities.

In [None]:
csal_genre_by_nationality_df = csal_ttr_df.groupby(['Genre', 'Nationality of Author']).size().unstack(fill_value=0)
csal_genre_by_nationality_df

In [None]:
csal_genre_by_nationality_df.plot(kind='bar', figsize=(10,5), title='Standardized TTRs Averaged Across Genre of Text and Nationality of Author')

Let's close today's class by 

- imagining how we could improve our approach to our original research question
- thinking of what other research questions we could ask of the CSAL dataset — with the TTR data we've added, or perhaps with some other metadata category or textual metric?