In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from helpers.helper_functions import *

# The Significant Bang Theory

Attention, ADA students!

The Sheldon Cooper we all know and love (OK, some of us might not know him, and some might not love him) from the TV series "The Big Bang Theory" has gotten into an argument with Leonard from the same TV show. Sheldon insists that he knows the show better than anyone, and keeps making various claims about the show, which neither of them know how to prove or disprove. The two of them have reached out to you ladies and gentlemen, as data scientists, to help them. You will be given the full script of the series, with information on the episode, the scene, the person saying each dialogue line, and the dialogue lines themselves.

Leonard has challenged several of Sheldon's claims about the show, and throughout this exam you will see some of those and you will get to prove or disprove them, but remember: sometimes, we can neither prove a claim, nor disprove it!

## Deadline
Wednesday, January 30th, 2019; 11:15 A.M. (Swiss time)

_For the deadline for extramural exams, see the submission subsection._

## Important notes
* Don't forget to add a textual description of your thought process, the assumptions you made, and your results!
* Please write all your comments in English, and use meaningful variable names in your code.
* As we have seen during the semester, data science is all about multiple iterations on the same dataset. Do not obsess over small details in the beginning, and try to complete as many tasks as possible during the first 2 hours. Then, go back to the obtained results, write meaningful comments, and debug your code if you have found any glaring mistake.
* Fully read the instructions for each question before starting to solve it to avoid misunderstandings, and remember to save your notebook often!
* The exam contains **15 questions organised into 4 tasks**, and is designed for more than 3 hours. **You do not need to solve everything in order to get a 6**, and you have some freedom is choosing the tasks you wish to solve.
* You cannot leave the room in the first and last 15 minutes.
* You can use all the online resources you want except for communication tools (emails, web chats, forums, phone, etc.). We will be monitoring the network for unusual activity.
* Remember, this is not a homework assignment -- no teamwork allowed!

## Submission
* Your file has to be named as "NameSurname_SCIPER.ipynb".
* Make sure you upload your Jupyter Notebook (1 file) to [this](https://goo.gl/forms/7GLvYl94uSOn54jH2) Google form at the end of the exam, with all the cells already evaluated (except for the Spark-related question, Q7). You need to sign in to Google using your EPFL credentials in order to submit the form.
* In case of problems with the form, send your Jupyter Notebook (along with your name and SCIPER number) as a direct message to @ramtin on Mattermost. This is reserved only for those who encounter problems with the submission -- you need to have a reasonable justification for using this back-up.
* You will have until 11:20 (strict deadline) to turn in your submission. **Late submissions will not be accepted.** This deadline is for the students taking the exam at EPFL -- students taking the exam extramurally will have their submission deadline as the starting time of the exam plus 3 hours and 5 minutes.

## Task A: Picking up the shovel (10 points)

**Note: You will use the data you preprocess in this task in all the subsequent ones.**

Our friends' argument concerns the entire show. We have given you a file in the `data/` folder that contains the script of every single episode. New episodes are indicated by '>>', new scenes by '>', and the rest of the lines are dialogue lines. Some lines are said by multiple people (for example, lines indicated by 'All' or 'Together'); **you must discard these lines**, for the sake of simplicity. However, you do not need to do it for Q1 in this task -- you'll take care of it when you solve Q2.

**Q1**. (5 points) Your first task is to extract all lines of dialogue in each scene and episode, creating a dataframe where each row has the episode and scene where a dialogue line was said, the character who said it, and the line itself. You do not need to extract the proper name of the episode (e.g. episode 1 can appear as "Series 01 Episode 01 - Pilot Episode", and doesn't need to appear as "Pilot Episode"). Then, answer the following question: In total, how many scenes are there in each season? We're not asking about unique scenes; the same location appearing in two episodes counts as two scenes. You can use a Pandas dataframe with a season column and a scene count column as the response.

**Note: The data refers to seasons as "series".**

In [2]:
# creating dataframe with episode, scene, character and dialogue
data = pd.DataFrame(columns=['Episode', 'Scene', 'Character', 'Speach'])

In [3]:
# Create the columns to be appended to to dataFrame
Episode = []
Scene = []
Character = []
Speach = []


In [4]:
# initialise episode, scene , character and speach to empty strings 
# every time a have a row I add each values to tables created above 
episode = ""
scene = ""
character = ""
speach = ""
# iterate over lines
with open('data/all_scripts.txt') as f:
    for line in f:
        # extract the episode by split
        if line[:2] == ">>":
            _,episode = line.split(">> ",1)
        # extract scene by split
        if line[:2] == "> ":
            _,scene = line.split("> ",1)
        # get character and speach
        if line[:2] != ">>" and line[:2] != "> ":
            character, speach = line.split(':',1)
        # once I have episode and Scene I append to lists
        if episode != "" and scene != "" and character != "":
            Episode.append(episode.rstrip())
            Scene.append(scene.rstrip())
            Character.append(character.rstrip())
            Speach.append(speach.rstrip())
        

In [5]:
# create dataFrame
data['Episode'] = Episode
data['Scene'] = Scene
data['Character'] = Character
data['Speach'] = Speach

In [6]:
data.head()

Unnamed: 0,Episode,Scene,Character,Speach
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,So if a photon is directed through a plane wi...
1,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,"Agreed, what’s your point?"
2,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,"There’s no point, I just think it’s a good id..."
3,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,Excuse me?
4,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Receptionist,Hang on.


**How many scence in each season**

In [7]:
# Extract the season which is the Serie number and then group by to get the count
data_season = data.copy()

In [8]:
Serie = []
for i in data['Episode']:
    serie = i.split(" E")
    Serie.append(serie[0])

In [9]:
data_season['Serie'] = Serie

In [10]:
data_season.head()

Unnamed: 0,Episode,Scene,Character,Speach,Serie
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,So if a photon is directed through a plane wi...,Series 01
1,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,"Agreed, what’s your point?",Series 01
2,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,"There’s no point, I just think it’s a good id...",Series 01
3,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,Excuse me?,Series 01
4,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Receptionist,Hang on.,Series 01


In [11]:
# Number of scenes per each epipsode
data_season[['Serie','Scene']].groupby(['Serie']).count()

Unnamed: 0_level_0,Scene
Serie,Unnamed: 1_level_1
Series 01,4311
Series 02,5492
Series 03,5289
Series 04,5907
Series 05,5125
Series 06,5213
Series 07,5701
Series 08,5620
Series 09,5779
Series 10,5890


**Each serie has been extracted by it's own and you can see above the count per each season**

**Q2**. (5 points) Now, let's define two sets of characters: all the characters, and recurrent characters. Recurrent characters are those who appear in more than one episode. For the subsequent sections, you will need to have a list of recurrent characters. Assume that there are no two _named characters_ (i.e. characters who have actual names and aren't referred to generically as "little girl", "grumpy grandpa", etc.) with the same name, i.e. there are no two Sheldons, etc. Generate a list of recurrent characters who have more than 90 dialogue lines in total, and then take a look at the list you have. If you've done this correctly, you should have a list of 20 names. However, one of these is clearly not a recurrent character. Manually remove that one, and print out your list of recurrent characters. To remove that character, pay attention to the _named character_ assumption we gave you earlier on. **For all the subsequent questions, you must only keep the dialogue lines said by the recurrent characters in your list.**

_Hint: "I know all the recurrent characters because I've watched the entire series five times" is not an acceptable argument, so you need to actually generate the list._

In [12]:
data.head()

Unnamed: 0,Episode,Scene,Character,Speach
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,So if a photon is directed through a plane wi...
1,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,"Agreed, what’s your point?"
2,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,"There’s no point, I just think it’s a good id..."
3,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,Excuse me?
4,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Receptionist,Hang on.


In [13]:
# Group by episode and character 
data_char = data.groupby(['Episode', 'Character'], as_index= False).count()[['Episode', 'Character', 'Scene']].rename(columns= {'Scene':'Count'})

In [14]:
data_char.head()

Unnamed: 0,Episode,Character,Count
0,Series 01 Episode 01 – Pilot Episode,Howard,20
1,Series 01 Episode 01 – Pilot Episode,Leonard,129
2,Series 01 Episode 01 – Pilot Episode,Man,1
3,Series 01 Episode 01 – Pilot Episode,Penny,61
4,Series 01 Episode 01 – Pilot Episode,Raj,3


In [15]:
# Keep only actors who spoke more than 90 times
total_90 =  data.groupby(['Character'], as_index= False).count()[[ 'Character', 'Scene']].rename(columns= {'Scene':'Count'})

In [16]:
total_90 = total_90[total_90['Count']>90]

In [17]:
total_90.head()

Unnamed: 0,Character,Count
10,Amy,3671
13,Arthur,145
20,Bernadette,2817
21,Bert,102
23,Beverley,171


In [18]:
# Drop duplicates so that the groupby on charater won't include speach in the same episode
recurrent_char = data_char.groupby(['Character'], as_index= False).count()

In [19]:
#  keep only ythose who spoke in more than one episode! 
recurrent_char = recurrent_char[recurrent_char['Episode']>1]

In [20]:
recurrent_char.head()

Unnamed: 0,Character,Episode,Count
2,Adam,3,3
4,Alex,4,4
5,Alfred,2,2
8,All,42,42
10,Amy,155,155


In [21]:
# The intersection will give 20 characters
recurecnt_chars = set(recurrent_char['Character']) & set(total_90['Character'])

In [22]:
len(recurecnt_chars)

20

In [23]:
recurecnt_chars

{'Amy',
 'Arthur',
 'Bernadette',
 'Bert',
 'Beverley',
 'Emily',
 'Howard',
 'Kripke',
 'Leonard',
 'Leslie',
 'Man',
 'Mrs Cooper',
 'Mrs Wolowitz',
 'Penny',
 'Priya',
 'Raj',
 'Sheldon',
 'Stuart',
 'Wil',
 'Zack'}

## Task B: Read the ~~stats~~ scripts carefully (30 points)

### Part 1: Don't put the shovel down just yet

**Q3**. (2.5 points) From each dialogue line, replace punctuation marks (listed in the EXCLUDE_CHARS variable provided in `helpers/helper_functions.py`) with whitespaces, and lowercase all the text. **Do not remove any stopwords, leave them be for all the questions in this task.**

In [24]:
# Lower case the speach
data['Speach'] = data['Speach'].str.lower()

In [26]:
# Remove punctuation function if they are in EXCLUDE_CHARS
def remove_punctuations(text):
    for punctuation in EXCLUDE_CHARS:
        text = text.replace(punctuation, '')
    return text

In [27]:
# apply the function on Speach
data['Speach'] = data['Speach'].apply(remove_punctuations)


**Q4**. (5 points) For each term, calculate its "corpus frequency", i.e. its number of occurrences in the entire series. Visualize the distribution of corpus frequency using a histogram. Explain your observations. What are the appropriate x and y scales for this plot?

### Part 2: Talkativity
**Q5**. (2.5 points) For each of the recurrent characters, calculate their total number of words uttered across all episodes. Based on this, who seems to be the most talkative character?

In [28]:
recurecnt_chars

{'Amy',
 'Arthur',
 'Bernadette',
 'Bert',
 'Beverley',
 'Emily',
 'Howard',
 'Kripke',
 'Leonard',
 'Leslie',
 'Man',
 'Mrs Cooper',
 'Mrs Wolowitz',
 'Penny',
 'Priya',
 'Raj',
 'Sheldon',
 'Stuart',
 'Wil',
 'Zack'}

In [29]:
# Keep only chars in recuurent char
talkative = data[data['Character'].isin(recurecnt_chars)].copy()

In [30]:
talkative.head()

Unnamed: 0,Episode,Scene,Character,Speach
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,so if a photon is directed through a plane wi...
1,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,agreed whats your point
2,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,theres no point i just think its a good idea ...
3,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,excuse me
5,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,one across is aegean eight down is nabakov tw...


In [33]:
# get the length of the speach
talkative['nb_words'] =talkative['Speach'].apply(lambda x : len(x))

In [None]:
talkative.head()

In [34]:
# Groupby the character and sum the number of words than sort values te get the most talkative ones
talkative.groupby(['Character'], as_index=False).sum().sort_values(['nb_words'], ascending=False)

Unnamed: 0,Character,nb_words
16,Sheldon,994903
8,Leonard,497404
13,Penny,378594
6,Howard,343245
15,Raj,301280
0,Amy,202823
2,Bernadette,133282
17,Stuart,39689
11,Mrs Cooper,16686
4,Beverley,10799


**Based on the dataframe above it's seems that Sheldon is the most talkative character**

**Q6**. (12.5 points) For each of the recurrent characters, calculate their total number of words uttered per episode (ignoring episodes that the character does not appear in), and calculate a **robust summary statistic** for the word count distribution of each person.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**i)** (2.5 points) What changes do you observe, compared to the analysis in Q5?

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**ii)** (2.5 points) Why is this analysis an improvement over the previous one, and how could you improve it even further? _Hint: The improvement involves making your unit for word counts even more granular - you can go further down than episodes._

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**iii)** (7.5 points) Incorporate that improvement. Do you still see the same results? How **confident** can you be that the "most talkative" person given by this twice improved method is really more talkative than the second most talkative one? _Hint: Read the question again. A good idea would be to use bootstrapping and calculate your summary statistic on each bootstrapped set._

In [36]:
talkative.head()

Unnamed: 0,Episode,Scene,Character,Speach,nb_words
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,so if a photon is directed through a plane wi...,272
1,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,agreed whats your point,24
2,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,theres no point i just think its a good idea ...,60
3,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,excuse me,10
5,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,one across is aegean eight down is nabakov tw...,203


In [37]:
# Get the number of talks per each episode
talkative_per_episode = talkative.groupby(['Character', 'Episode'], as_index=False).sum().sort_values(['nb_words'], ascending=False)

In [38]:
talkative_per_episode.head()

Unnamed: 0,Character,Episode,nb_words
1395,Sheldon,Series 02 Episode 13 – The Friendship Algorithm,8951
1423,Sheldon,Series 03 Episode 18 – The Pants Alternative,8501
1391,Sheldon,Series 02 Episode 09 – The White Asparagus Tri...,8167
1375,Sheldon,Series 01 Episode 10 – The Loobenfeld Decay,7808
1440,Sheldon,Series 04 Episode 12 – The Bus Pants Utilization,7807


In [39]:
talkative_per_episode.groupby(['Character'], as_index=False).count().sort_values(['nb_words'], ascending=False)

Unnamed: 0,Character,Episode,nb_words
16,Sheldon,231,231
6,Howard,231,231
8,Leonard,231,231
15,Raj,230,230
13,Penny,229,229
2,Bernadette,161,161
0,Amy,155,155
17,Stuart,62,62
12,Mrs Wolowitz,27,27
10,Man,22,22


**We can see that Sheldon still at the top but the ranking has a little bit changed since Howard is in the last place now. So the ordering changed!** 

ii)

We can improve our analyis by selecting only words that have a meaning! for example remove stop words.

### Part 3: Obligatory Spark cameo
**Q7**. (7.5 points) Write a Spark script that does the following: Given the raw input file and your list of recurrent characters, create an RDD containing (speaker, dialogue line) rows **only for the recurrent characters** (assume that you already have the list --  no need to calculate it using Spark), and then generate a vectorized bag of words representation for each dialogue line, thus generating an RDD with (speaker, bag of words vector) rows. Then, calculate an aggregated bag of words vector (sum of all vectors) for each person. The final output is therefore an RDD with each of its rows being (speaker, aggregated bag of words vector). For your bag of words vectors, you can use $1\times|V|$ scipy CSR matrices (where $|V|$ is the size of the vocabulary). No filtering of the vocabulary is necessary for this part.

You do not need to run this script, but you do need to use Spark logic and also, the syntax needs to be correct.

## Task C: The Gossip Graph (30 points)

**Note: Only for this task, discard the recurrent characters whose names are not single words, e.g. Mrs. Cooper.**

Let us define _gossip_ as follows: if a dialogue line of character A mentions B by name in a scene that does not involve character B, we say that “A gossips about B” in that line. Multiple mentions of the same person in a single line are counted once, but a character can gossip about several others in the same line. For the sake of simplicity, we only consider gossips where the name of the recurrent character is mentioned as it appears in our list of characters; for example, if someone says "Cooper" and they mean Sheldon, we discard that.

**Q8**. (12.5 points) Create the two following graphs first:

1. (5 points) Create the _familiarity graph_, an undirected weighted graph, in which there is a node for each recurrent character, and an edge between two characters if they appear together in at least one scene. The weight of the edge between them is the number of scenes they appear in together. If an edge exists between two people in the familiarity graph, we say that they "know each other".
2. (7.5 points) Create the _gossip graph_, which is a directed weighted graph, in which there there is a node for each recurrent character, and a directed edge from the node for A to the node for B if A has gossiped about B at least once. The weight of the edge is the number of scenes in which A has gossiped about B.

_Hint: You can create each graph first as an adjacency matrix and then create a networkx graph out of that._

In [40]:
# convert set to list
recurecnt_chars = list(recurecnt_chars)

In [41]:
# Keep only sigle char names
single_names_char = []
for i in recurecnt_chars:
    if len(i.split(' '))==1:
        single_names_char.append(i)

In [42]:
single_names_char

['Man',
 'Raj',
 'Leslie',
 'Kripke',
 'Arthur',
 'Howard',
 'Amy',
 'Penny',
 'Sheldon',
 'Wil',
 'Leonard',
 'Stuart',
 'Bernadette',
 'Emily',
 'Beverley',
 'Bert',
 'Zack',
 'Priya']

1)

In [None]:
# first create the data frame and the convert it to a graph


Now, answer the following questions:

**Q9**. (5 points) Sheldon claims that every character in the show is familiar with everyone else through at most one intermediary. Based on the familiarity graph, is this true? If not, at most how many intermediaries are needed?

**Q10**. (5 points) Who is the character through whom the largest number of these indirect familiarities happen? Calculate an appropriate centrality metric on the familiarity graph to answer this question. You can use the package networkx for this section.

**Q11**. (2.5 points) Another claim of Sheldon's is that every recurrent character in the show gossips about all the other recurrent characters. What property of the gossip graph would correspond to this? Does the gossip graph possess that property? If not, then is it the case that for every pair of recurrent characters, at least one gossips about the other? What property would this correspond to?

**Q12**. (5 points) Use the gossip graph and the familiarity graph to figure out if for every pair of recurrent characters, one of them has gossiped about the other if and only if they know each other. Explain your method - the simpler, the better.

## Task D: The Detective's Hat (30 points)

Sheldon claims that given a dialogue line, he can, with an accuracy of above 70%, say whether it's by himself or by someone else. Leonard contests this claim, since he believes that this claimed accuracy is too high. Leonard also suspects that it's easier for Sheldon to distinguish the lines that _aren't_ his, rather than those that _are_. We want you to put on the (proverbial) detective's hat and to investigate this claim.

**Q13**. (7.5 points) Divide the set of all dialogue lines into two subsets: the training set, consisting of all the seasons except the last two, and the test set, consisting of the last two seasons. Each of your data points (which is one row of your matrix) is one **dialogue line**. Now, use the scikit-learn class **TfIdfVectorizer** to create TF-IDF representations for the data points in your training and test sets. Note that since you're going to train a machine learning model, everything used in the training needs to be independent of the test set. As a preprocessing step, remove stopwords and words that appear only once from your vocabulary. Use the simple tokenizer provided in `helpers/helper_functions.py` as an input to the TfidfVectorizer class, and use the words provided in `helpers/stopwords.txt` as your stopwords.

In [43]:
data_season.head()

Unnamed: 0,Episode,Scene,Character,Speach,Serie
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,So if a photon is directed through a plane wi...,Series 01
1,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,"Agreed, what’s your point?",Series 01
2,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Sheldon,"There’s no point, I just think it’s a good id...",Series 01
3,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Leonard,Excuse me?,Series 01
4,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Receptionist,Hang on.,Series 01


In [44]:
# How many serie do we have? 
data_season.groupby(['Serie']).sum()

Unnamed: 0_level_0,Episode,Scene,Character,Speach
Serie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Series 01,Series 01 Episode 01 – Pilot EpisodeSeries 01 ...,A corridor at a sperm bank.A corridor at a spe...,SheldonLeonardSheldonLeonardReceptionistLeonar...,So if a photon is directed through a plane wi...
Series 02,Series 02 Episode 01 – The Bad Fish ParadigmSe...,The Szechuan Palace.The stairwell.The stairwel...,SheldonSheldonLeonardPennyLeonardPennyLeonardP...,No. Don’t call the library. Show me your mucu...
Series 03,Series 03 Episode 01 – The Electric Can Opener...,The North Pole.Opening shows some scenes from ...,SheldonSheldonLeonardHowardRajSheldonSheldonSh...,Three months. This is gonna be great! Three m...
Series 04,Series 04 Episode 01 – The Robotic Manipulatio...,A coffee shop.The apartment. A robotic arm is ...,HowardHowardHowardLeonardRajHowardRajHowardRaj...,"Good God, what have we done? Good God, what h..."
Series 05,Series 05 Episode 01 – The Skank Reflex Analys...,The living room. Leonard is asleep on the couc...,SheldonSheldonSheldonLeonardSheldonLeonardShel...,What does it look like? What does it look lik...
Series 06,Series 06 Episode 01 – The Date Night Variable...,The apartment.The Comic Book Store.The Comic B...,HowardHowardStuartLeonardRajSheldonLeonardStua...,"Oy vay! Oy vay! So, Howard’s really in space,..."
Series 07,Series 07 Episode 01 – The Hofstadter Insuffic...,Penny’s apartment.On the deck of a ship on the...,RajRajLeonardSheldonLeonardSheldonLeonardSheld...,"… but then it turns good again, and that mean..."
Series 08,Series 08 Episode 01 – The Locomotion Interrup...,The apartment.A railway station. Sheldon is we...,AmyAmySheldonManSheldonSheldonLeonardPennyLeon...,How could you let him go? How could you let h...
Series 09,Series 09 Episode 01 – The Matrimonial Momentu...,The apartment.A Wedding Chapel.A Wedding Chape...,SheldonSheldonPennyLeonardPennyLeonardPennyLeo...,"Well, Gollum, you’re an expert on rings. What..."
Series 10,Series 10 Episode 01 – The Conjugal Conjecture...,The apartment.Leonard and Penny’s bedroom.Leon...,PennyPennySheldonLeonardSheldonLeonardSheldonL...,Really? ‘Cause I love it. Really? ‘Cause I lo...


In [89]:
# split to train and test, from 1 to 8 train from 9 to 10 test
serie_num = ['Series 09', 'Series 10']
train = data_season[~data_season['Serie'].isin(serie_num)][['Speach', 'Character']]
test =  data_season[data_season['Serie'].isin(serie_num)][['Speach', 'Character']]

In [90]:
train.head()

Unnamed: 0,Speach,Character
0,So if a photon is directed through a plane wi...,Sheldon
1,"Agreed, what’s your point?",Leonard
2,"There’s no point, I just think it’s a good id...",Sheldon
3,Excuse me?,Leonard
4,Hang on.,Receptionist


In [91]:
# Apply tokenizer to train and test
train['Speach'] =train['Speach'].apply( lambda x: simple_tokeniser(x))
test['Speach'] =test['Speach'].apply( lambda x: simple_tokeniser(x))

In [92]:
# After tokeinizing we need to join in order to get back strings and pass them to tfidf
train['Speach'] = train['Speach'].apply( lambda x: ' '.join(x))
test['Speach'] = test['Speach'].apply( lambda x: ' '.join(x))

In [118]:
# Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# train vectorize
v = TfidfVectorizer()
X = v.fit_transform(train['Speach'])
# Test vectorizer
v = TfidfVectorizer()
y = v.fit_transform(test['Speach'])


**Q14**. (5 points) Find the set of all words in the training set that are only uttered by Sheldon. Is it possible for Sheldon to identify himself only based on these? Use the test set to assess this possibility, and explain your method.

In [94]:
# find the set of all word said by sheldon
sheldon_words = train.copy()

In [95]:
sheldon_words = sheldon_words[sheldon_words['Character'] == "Sheldon"]

In [96]:
sheldon_words.head()

Unnamed: 0,Speach,Character
0,So if a photon is directed through a plane wit...,Sheldon
2,"There’s no point, I just think it’s a good ide...",Sheldon
9,I think this is the place.,Sheldon
13,"Leonard, I don’t think I can do this.",Sheldon
15,No. We are committing genetic fraud. There’s n...,Sheldon


In [105]:
# Get the words said by sheldon only
set_words_sheldon = set()
# Go over the line of speach
for word in sheldon_words['Speach']:
    words = word.split()
    # for every line get the words and add them to the set
    for i in words:
        set_words_sheldon.add(i)

In [106]:
len(set_words_sheldon)

22144

In [108]:
# get all the words expect sheldon
set_words = set()
# Go over the line of speach
for word in train[train['Character']!='Sheldon']['Speach']:
    words = word.split()
    # for every line get the words and add them to the set
    for i in words:
        set_words.add(i)

In [109]:
# get unique words of sheldon
unique_sheldon = set_words_sheldon - set_words

In [110]:
len(unique_sheldon)

10843

In [112]:
list(unique_sheldon)[:10]

['boon',
 'ablutions',
 'wading',
 'Des',
 'snap,',
 'mache',
 'And…',
 'L.H.,',
 'sabbatical,',
 'pubis']

**We can see that there is some words that unique to sheldon so we can identify him with them if they appear in the test set**

In [113]:
# get all the words expect sheldon
set_words_test = set()
# Go over the line of speach
for word in test['Speach']:
    words = word.split()
    # for every line get the words and add them to the set
    for i in words:
        set_words_test.add(i)

In [114]:
identify = set_words_test & unique_sheldon

In [116]:
len(identify)

961

**The intersection is not empty! So in those line we can say the character is sheldon**

**Q15**. (17.5 points) Now, perform singular value decomposition (SVD) on the training TF-IDF matrix, and calculate a **25-dimensional approximation** for both the training and test TF-IDF matrices (you can do this using scikit-learn's **TruncatedSVD** class). Then, train a logistic regression classifier with 10-fold cross-validation (using the scikit-learn **LogisticRegressionCV** class) on the output of the SVD that given a dialogue line, tells you whether it's by Sheldon or by someone else.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**i)** (7.5 points) Report precision, recall and F1-score for both classes (Sheldon and not-Sheldon), as well as accuracy, of your classifier on the training set and the test set. You need to implement the calculation of the evaluation measures (precision, etc.) yourself -- using the scikit-learn functions for them is not allowed.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**ii)** (5 points) What difference do you observe between the model's scores on the training and test sets? What could you infer from the amount of difference you see? What about the difference between scores on the two classes? Given the performance of your classifier, is Leonard right that the accuracy Sheldon claims is unattainable? What about his suspicions about the lines that Sheldon can and cannot distinguish?
    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**iii)** (2.5 points) List 10 of the most extreme false positives and 10 of the most extreme false negatives, in terms of the probabilities predicted by the logistic regression model. What are common features of false positives? What about the false negatives?
    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**iv)** (2.5 points) What is the most important feature in the model? What are the 5 most important words in this feature? _Hint: Think of the definition of an SVD, and that you did an SVD on the TF-IDF matrix with dialogue lines as rows and words as columns. You have projected the original data points onto a 25-dimensional subspace -- you need to look at the unit vectors you used for the projection._

i)

In [120]:
# compute the precision,recall,F1_score,accuracy metrics
def compute_metrics(TN,FN,FP,TP):
    # compute precision
    precision = TP/(TP+FP)
    # compute recall
    recall = TP/(TP+FN)
    # compute F1_score
    F1_score = 2 * precision * recall / (precision + recall)
    return precision,recall,F1_score

In [122]:
def accuracy(pred, y):
    '''
    Calculates the accuracy by comparing the predictions with given test data.
    '''
    N = len(pred)
    count = 0.0
    for i in range(len(pred)):
        if pred[i] == y[i]:
            count += 1
    return count/N

In [123]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegressionCV
# Cross validation with k-folds = 10
clf = LogisticRegressionCV(cv=10).fit(X, y)

ii)

In [121]:
# separate classes 
def get_classes(validation_tab):
    class_adopted = [i[0] for i in validation_tab]
    class_nonadopted = [i[1] for i in validation_tab]
    return class_adopted,class_nonadopted

# this function will return precision,recall, F1_score, accuracy
def get_metrics(class_):
    return [i[0] for i in class_], [i[1] for i in class_], [i[2] for i in class_], [i[3] for i in class_]

iv)

In [None]:
# important features, frop the character 
feat_importances = pd.Series(clf.feature_importances_, index=train.reset_index().drop(['index','Character'], axis = 1).columns)