# Data Mini: Lab 1.2

For this exercise, I have chosen to analyze discussions on True Detective, another anthology series, in this case one whose first season was way more acclaimed than the second one. I am interested in checking if there are more positive adjectives in the reviews of the first season than in the second, due to the difference of reviews mentioned before. Additionally I will also inspect the nouns.

In [1]:
from collections import Counter
import spacy
import pandas as pd

from tqdm.notebook import tqdm
from scipy.stats import chi2_contingency

nlp = spacy.load("en_core_web_sm")

In [2]:
# Load data discussions.p
df = pd.read_pickle('discussions.p')

In [3]:
# Process the posts
posts = df.post.values
processed_texts = [text for text in tqdm(nlp.pipe(posts, 
                                              n_process=-1, 
                                              disable=["ner",
                                                       "parser"]),
                                          total=len(posts
                                                   ))]

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=50000.0), HTML(value='')))




In [4]:
# Store the processed texts as an attribute of the df
df['processed_texts'] = processed_texts

### 1.3

In [7]:
true_detective_posts = df[df.title == "True Detective"]

In [8]:
# Three seasons of True Detective have been released: in 2014, 2015 and 2019, respectively
true_detective_posts.year.unique()

array([2015, 2014, 2019])

In [9]:
s1 = true_detective_posts[true_detective_posts.year == 2014]
s2 = true_detective_posts[true_detective_posts.year == 2015]

In [48]:
s2_nouns

['faerie',
 'wing',
 'person',
 'prostitute',
 'customer',
 'world',
 'end',
 'deal',
 'deal',
 'sense',
 'staple',
 'trope',
 'excuse',
 'conceal',
 'problem',
 'case',
 'sense',
 'crow',
 'mask',
 'corner',
 'shack',
 'location',
 'cover',
 'cover',
 'season',
 'moment',
 'actress',
 'memory',
 'picture',
 'mayor',
 'end',
 'suit',
 'jacket',
 'gunshot',
 'wound',
 'chest',
 'foreshadowing',
 'father',
 'commune',
 'kind',
 'role',
 'childhood',
 'episode',
 'kid',
 'suicide',
 'prison',
 'detective',
 'father',
 'belief',
 'lifestyle',
 'shit',
 'investigation',
 'cop',
 'fun',
 'setup',
 'agency',
 'mayor',
 'office',
 'plot',
 'shootout',
 'ghetto',
 'drug',
 'idk',
 'season',
 'complaint',
 'incident',
 'news',
 'woman',
 'man',
 'thing',
 'woman',
 'result',
 'training',
 'point',
 'story',
 'man',
 'stomach',
 'power',
 'dude',
 'tooth',
 'way',
 'watch',
 'thing',
 'pawn',
 'shop',
 'shit',
 'politician',
 'police',
 'dick',
 'orgy',
 'invite',
 'time',
 'place',
 'buss',
 'sh

In [47]:
# Processing seasons datasets (getting nouns and adjs)
flatten = lambda t: [item for sublist in t for item in sublist]

s1_nouns = [[word.lemma_.lower() for word in text if word.pos_ == 'NOUN'] for text in s1.processed_texts]
s1_nouns = flatten(s1_nouns)

s1_adjs = [[word.lemma_.lower() for word in text if word.pos_ == 'ADJ'] for text in s1.processed_texts]
s1_adjs = flatten(s1_adjs)

s2_nouns = [[word.lemma_.lower() for word in text if word.pos_ == 'NOUN'] for text in s2.processed_texts]
s2_nouns = flatten(s2_nouns)

s2_adjs = [[word.lemma_.lower() for word in text if word.pos_ == 'ADJ'] for text in s2.processed_texts]
s2_adjs = flatten(s2_adjs)

### 3. Think of a word that will definitely occur more often in the first subset than in the second (or vice versa), and think of a word that will not differ that much. Compare the frequency of those words using the log-likelihood measure (so you’ll calculate the LLR twice – once for each word).

In [12]:
# get counters for each season
counts_c1 = Counter(s1_adjs)
counts_c2 = Counter(s2_adjs)

In [23]:
# since many people was disappointed with the second season, I will how distinctive the word 'bad' is
differ_word = 'bad'
freq_c1 = counts_c1[differ_word]
freq_c2 = counts_c2[differ_word]

freq_c1_other = sum(counts_c1.values()) - freq_c1
freq_c2_other = sum(counts_c2.values()) - freq_c2

llr, p_value,_,_ = chi2_contingency([[freq_c1, freq_c2], 
                  [freq_c1_other, freq_c2_other]],
                  lambda_='log-likelihood')
if freq_c2 / freq_c2_other > freq_c1 / freq_c1_other: # adjust sign of llr
    llr = -llr
    
print("Log-likelihood: ", llr)
print('p-value', p_value)

Log-likelihood:  -10.184661652600141
p-value 0.0014161370413705767


In [24]:
# a word that has normally been used to describe this show is 'dark'. let's see how similar the use of the word is
differ_word = 'dark'
freq_c1 = counts_c1[differ_word]
freq_c2 = counts_c2[differ_word]

freq_c1_other = sum(counts_c1.values()) - freq_c1
freq_c2_other = sum(counts_c2.values()) - freq_c2

llr, p_value,_,_ = chi2_contingency([[freq_c1, freq_c2], 
                  [freq_c1_other, freq_c2_other]],
                  lambda_='log-likelihood') 

if freq_c2 / freq_c2_other > freq_c1 / freq_c1_other: # adjust sign of llr
    llr = -llr
    
print("Log-likelihood: ", llr)
print('p-value', p_value)

Log-likelihood:  0.11104046481874308
p-value 0.7389626766715127


### 4. Get the most distinctive words of the first subset compared to second subset, and vice versa

In [25]:
def distinctive_words(target_corpus, reference_corpus):
    counts_c1 = Counter(target_corpus) # don't forget to flatten your texts!
    counts_c2 = Counter(reference_corpus)
    vocabulary = set(list(counts_c1.keys()) + list(counts_c2.keys()))
    freq_c1_total = sum(counts_c1.values()) 
    freq_c2_total = sum(counts_c2.values()) 
    results = []
    for word in vocabulary:
        freq_c1 = counts_c1[word]
        freq_c2 = counts_c2[word]
        freq_c1_other = freq_c1_total - freq_c1
        freq_c2_other = freq_c2_total - freq_c2
        llr, p_value,_,_ = chi2_contingency([[freq_c1, freq_c2], 
                      [freq_c1_other, freq_c2_other]],
                      lambda_='log-likelihood') 
        if freq_c2 / freq_c2_other > freq_c1 / freq_c1_other:
            llr = -llr
        result = {'word':word, 
                    'llr':llr,
                    'p_value': p_value}
        results.append(result)
    results_df = pd.DataFrame(results)
    return results_df

In [26]:
results_comp = distinctive_words(s1_adjs, s2_adjs)

In [28]:
results_comp.sort_values('llr', ascending=False)

Unnamed: 0,word,llr,p_value
24,yellow,28.715118,8.384663e-08
645,strange,6.363820,1.164705e-02
766,sorry,5.735741,1.662306e-02
287,little,5.052105,2.459604e-02
363,many,4.644083,3.116078e-02
...,...,...,...
1,well,-6.443919,1.113327e-02
400,weird,-8.296338,3.972511e-03
618,mexican,-8.629000,3.308532e-03
649,bad,-10.184662,1.416137e-03


The most distinctive adjective in the case of the first season seems to be 'yellow', which I assume is connected to character of 'the yellow king' in the first season.

In the case of the second season, it seems to be 'hard', which I cannot connect to any particular detail of the show. However, the incidence of the adjective 'bad' is also high, which is explained by the difference in the reception of both seasons.

### 5. Get the most distinctive words of the first subset compared to all the posts that are not in the first subset.

In [30]:
subset_not_1 = df[(df.title != 'True Detective') & (df.year != 2014)].processed_texts
subset_not_1 = [[word.lemma_.lower() for word in text if word.pos_ == 'NOUN'] for text in subset_not_1]
subset_not_1 = flatten(subset_not_1)

In [31]:
results_df_tds1 = distinctive_words(s1_nouns, subset_not_1)

In [32]:
results_df_tds1.sort_values('llr', ascending=False).head(20)

Unnamed: 0,word,llr,p_value
11431,cult,218.5278,1.894579e-49
10466,detective,130.533913,3.131365e-30
2628,rust,96.760337,7.824800000000001e-23
8315,marty,87.178438,9.915894e-21
10033,case,59.897203,9.994338e-15
2863,daughter,54.682291,1.416772e-13
3419,circle,53.055547,3.24248e-13
5188,reggie,43.580594,4.068535e-11
11427,lawn,43.427424,4.399786e-11
2326,finale,35.646221,2.366064e-09


### 6 Get the most distinctive words of the second subset compared to all the posts that are not in the second subset.

In [49]:
subset_not_2 = df[(df.title != 'True Detective') & (df.year != 2015)].processed_texts
subset_not_2 = [[word.lemma_.lower() for word in text if word.pos_ == 'NOUN'] for text in subset_not_2]
subset_not_2 = flatten(subset_not_2)

In [50]:
results_df_tds2 = distinctive_words(s2_nouns, subset_not_2)

In [51]:
results_df_tds2.sort_values('llr', ascending=False).head(20)

Unnamed: 0,word,llr,p_value
9162,diamond,129.478104,5.33012e-30
9978,detective,68.456017,1.297393e-16
3252,noir,60.266719,8.283621e-15
1116,mayor,60.159572,8.747047e-15
10925,cult,57.543276,3.306243e-14
3538,mask,57.013196,4.328975e-14
5226,officer,55.13728,1.123987e-13
8280,drive,51.171583,8.463501e-13
11322,deal,47.046481,6.932303e-12
10040,corruption,45.422511,1.587987e-11


### 7. Get the most distinctive words of the first and the second subset, compared to all the posts that are neither in the first nor in the second subset.

In [57]:
subset_1_2 = df[((df.title == 'True Detective') & (df.year == 2014))|((df.title == 'True Detective') & (df.year == 2015))].processed_texts
subset_1_2 = [[word.lemma_.lower() for word in text if word.pos_ == 'NOUN'] for text in subset_1_2]
subset_1_2 = flatten(subset_1_2)

subset_not_1_2 = df[(df.title != 'True Detective') & ~(df.year.isin([2014, 2015]))].processed_texts
subset_not_1_2 = [[word.lemma_.lower() for word in text if word.pos_ == 'NOUN'] for text in subset_not_1_2]
subset_not_1_2 = flatten(subset_not_1_2)

results_df_both = distinctive_words(subset_1_2, subset_not_1_2)

In [59]:
results_df_both.sort_values('llr', ascending=False).head(20)

Unnamed: 0,word,llr,p_value
10952,cult,235.450754,3.860851e-53
10010,detective,172.058586,2.627642e-39
9191,diamond,109.190289,1.474307e-25
3546,mask,80.368808,3.106607e-19
2521,rust,80.091231,3.5751559999999995e-19
3270,circle,67.151803,2.513853e-16
3259,noir,56.99763,4.363374e-14
7965,marty,56.9134,4.554321e-14
5243,officer,46.994896,7.117182e-12
1119,mayor,46.994896,7.117182e-12
