# INFO 2950 Homework 7

In [1]:
import numpy as np
import pandas as pd

## Part 1: Joint and Conditional Probabilities, Bayes' Rule

Let's start with a dataset describing El Clásico Primera División soccer matches between Real Madrid and FC Barcelona (data source: https://en.wikipedia.org/wiki/List_of_El_Cl%C3%A1sico_matches). The data are represented by a two dimensional grid where indices (indexed starting at 0) correspond to the number of goals scored by Real Madrid (rows) and Barcelona (columns). 

The number of matches where Real Madrid scored **i** goals and FC Barcelona scored **j** goals can be found in the cell `soccer_data[i][j]`.  For example, `soccer_data[3, 2]` indicates the number of matches where Real Madrid scored 3 goals and FC Barcelona scored 2 goals; there are 7 matches where this was the final score.

We consider an array of size `(9, 9)`, implying that the range of goals scored by either Real Madrid or Barcelona (in any match) is between 0 and 8.

In [2]:
soccer_data = np.array([[9, 14, 9, 8, 3, 5, 0, 0, 0], [8, 16, 13, 9, 0, 2, 0, 0, 0], [11, 21, 7, 5, 2, 0, 1, 1, 0], [7, 6, 7, 2, 1, 0, 0, 0, 0], [3, 4, 1, 1, 0, 0, 0, 0, 0], [2, 3, 0, 1, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0]])

In [3]:
print(soccer_data.shape)
print(soccer_data)

(9, 9)
[[ 9 14  9  8  3  5  0  0  0]
 [ 8 16 13  9  0  2  0  0  0]
 [11 21  7  5  2  0  1  1  0]
 [ 7  6  7  2  1  0  0  0  0]
 [ 3  4  1  1  0  0  0  0  0]
 [ 2  3  0  1  0  1  0  0  0]
 [ 0  1  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0]
 [ 0  0  1  0  0  0  0  0  0]]


Let's assume that random variables `R` and `B` denote the number of goals scored by Real Madrid and FC Barcelona, respectively. Remember that `R` and `B` can each take any integer value between 0 and 8.

We can think about `El Clásico` statistically by reframing "matches" as "trials". Then, we let variable `N` represent the total number of trials (in which we sample both of variables `R` and `B`).

## Problem 0 (5 Points)

Use numpy to compute `N` from `soccer_data`:

In [4]:
# your code here
N = np.sum(soccer_data)

## Problem 1 (10 Points)

The probability that `R` will take the value `i` and `B` will take the value `j` is denoted as `Pr(R = i, B = j)` and is called the joint probability of `R = i` and `B = j`.

Compute the joint probablity `Pr(R = 4, B = 1)` where Real Madrid scores 4 goals and FC Barcelona scores 1 goal.

In [5]:
# your code here
print((soccer_data[4,1])/N) 

0.021621621621621623


## Problem 2 (10 Points)

`Pr(R = i)` can be derived by marginalizing the random variable from the joint probabilty computed in the previous problem.

Compute the probabilty `Pr(R = 4)`. (You may code this either by calculating the probability directly, or though marginalizing the random variable `B` from joint probabilities).

In [6]:
# your code here
print(np.sum(soccer_data[4])/N) 

0.04864864864864865


Now, calculate the conditional probability of FC Barcelona scoring 4 goals given Real Madrid scores 2 goals in that match. Save this value as a variable called `prob_b4_r2` and print the value.  Recall that $$\Pr \left[ A | B \right] = \frac{\Pr \left[ A, B \right]}{\Pr \left[ B \right]}$$

(Notation hint: the conditional probability of `B = j` given `R = i` is expressed as `Pr(B = j | R = i)`.)

In [7]:
# your code here
prob_b4_r2 = (np.sum(soccer_data[2,4])) / (np.sum(soccer_data[2])) 
print(prob_b4_r2 ) 

0.041666666666666664


## Problem 3 (10 Points)

Compute the conditional probabilty of Real Madrid scoring 2 goals given that FC Barcelona scores exactly 4 goals in that match by applying Bayes' Rule using the following three steps:

First, make a variable called `prob_r2` that calculates the marginal probability that R = 2.

Second, make a variable called `prob_b4` that calculates the marginal probability that B = 4.

Then, make a variable called `cond_prob` that uses `prob_b4_r2` (from Problem 2), `prob_r2`, and `prob_b4` to calculate the conditional (using Bayes' rule). 

Finally, print `cond_prob`.

#### Bayes' rule

Bayes' rule relates the conditional probability $\Pr \left[ A | B \right]$ to the conditional probability $\Pr \left[ B | A \right]$:

$$\Pr \left[ A | B \right] = \frac{\Pr \left[ B | A \right] \Pr \left[ A \right] }{\Pr \left[ B \right]}$$

While this may initially seem mysterious, we can actually derive Bayes' rule by applying two definitions:

1. We start with the definition of the probability of $A$ conditioned on $B$:

$$\Pr \left[ A | B \right] = \frac{\Pr \left[ A, B \right]}{\Pr \left[ B \right]}$$

2. Now, we can plug in the definition of the joint probability $\Pr \left[ A, B \right] = \Pr \left[ B | A \right] \Pr \left[ A \right]$ in the numerator on the right hand side:

$$\Pr \left[ A | B \right] = \frac{\Pr \left[ B | A \right] \Pr \left[ A \right] }{\Pr \left[ B \right]}$$

And we have just derived Bayes' rule!

In [8]:
# your code here

prob_r2 = (np.sum(soccer_data[2])) 
prob_b4 = (np.sum(soccer_data[:,4]))

cond_prop = (prob_b4_r2)*(prob_r2 )/prob_b4
print(cond_prop)


0.3333333333333333


## Problem 4 (15 Points)
Let's say you have a friend with whom you regularly bet against. This friend only ever uses one of four coins (all of the coins have two sides: either heads or tails). Three of these coins are perfectly balanced, landing on heads with 50% probability and tails with 50% probability. The fourth coin is your friend's "lucky" coin, which lands on heads 75% of the time (and on tails 25% of the time). Aside from these probabilities, the four coins are perfectly identical. You know your friend has an equal chance of using any of the four coins, but you don't know which one they are using today (perhaps you should reconsider this friendship).

Your friend flips one of these coins and it lands on heads. What is the probability that it is your friend's "lucky" coin?

**Use markdown to write an explanation of your solution. Use the Python cell below to type your arithmetic calculations and output your answer.**

**Answer here**

we can use bayes' theorem to solve this problem. 
pr(lucky|heads) = pr(heads|lucky)pr(lucky)/pr(heads) 

pr(heads|lucky) = probability of heads given the lucky coin = .75 

pr(lucky) = probability of using the lucky coin = .25 

pr(heads) = probability of getting heads overall =  pr(heads|lucky) * pr(lucky) + pr(heads|notlucky) * pr(notlucky) = .75*.25 + .5*.75


we plug these values into our formula like this: 
((.25*.75)/ (.75*.25 + .5*.75)) = 1/3 

1/3 is our answer

In [9]:
# your code here

print((.25*.75)/ (.75*.25 + .5*.75)) 

0.3333333333333333


## Part 2: Log probability and sequences of events

## Problem 5 (10 Points)

In class we calculated the probability of ghost or pumpkin from two "urns", and used those numbers to guess which urn was more likely to have produced that sequence of emoji.
In this problem you will do the same thing but with a sequence of letters. Instead of urns, you will compare the probability of each sequence of letters to the probability of each letter for several European languages.

Start by loading letter frequency data from the file `letter_frequency.csv`. This data is from [Wikipedia](https://en.wikipedia.org/wiki/Letter_frequency#Relative_frequencies_of_letters_in_other_languages), collected by Adrianus Kleemans in a data file at [this Github repo](https://github.com/akleemans/letter-frequency).

Look at the contents of the file `letter_frequency.csv` in a text editor, or through Jupyter. Describe one fact that is unusual about this CSV file.

In [10]:
pd.read_csv("letter_frequency.csv")

Unnamed: 0,Letter;French;German;Spanish;Portuguese;Esperanto;Italian;Turkish;Swedish;Polish;Dutch;Danish;Icelandic;Finnish;Czech
0,a;7.636;6.516;11.525;14.634;12.117;11.745;12.9...
1,b;0.901;1.886;2.215;1.043;0.980;0.927;2.844;1....
2,c;3.260;2.732;4.019;3.882;0.776;4.501;1.463;1....
3,d;3.669;5.076;5.010;4.992;3.044;3.736;5.206;4....
4,e;14.715;16.396;12.181;12.570;8.995;11.792;9.9...
...,...
77,ŭ;0;0;0;0;0.520;0;0;0;0;0;0;0;0;0
78,ů;0;0;0;0;0;0;0;0;0;0;0;0;0;0.204
79,ź;0;0;0;0;0;0;0;0;0.078;0;0;0;0;0
80,ż;0;0;0;0;0;0;0;0;0.706;0;0;0;0;0


**Answer here**
The semicolons are something I found unusual, and its not really in a standard dataframe format. 


## Problem 6 (10 Points)

Use the function `pandas.read_csv` to load the letter frequencies file. Consult the documentation for this function to specify the correct delimiter and to use the field `Letter` as the index column. Save the output in a variable `letter_data`.

The numeric values in the file are percentages, but we want probabilities. Multiply the `letter_data` data frame by `0.01`. Use the function `head` to display the first five rows of the data frame, to confirm that you separated the fields correctly. The value for `a` in `French` should be 0.07636.

In [11]:
# your code here
letter_data = pd.read_csv("letter_frequency.csv", delimiter=";", index_col="Letter")
letter_data = letter_data*0.01
letter_data.head()

Unnamed: 0_level_0,French,German,Spanish,Portuguese,Esperanto,Italian,Turkish,Swedish,Polish,Dutch,Danish,Icelandic,Finnish,Czech
Letter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
a,0.07636,0.06516,0.11525,0.14634,0.12117,0.11745,0.1292,0.09383,0.10503,0.07486,0.06025,0.1011,0.12217,0.08421
b,0.00901,0.01886,0.02215,0.01043,0.0098,0.00927,0.02844,0.01535,0.0174,0.01584,0.02,0.01043,0.00281,0.00822
c,0.0326,0.02732,0.04019,0.03882,0.00776,0.04501,0.01463,0.01486,0.03895,0.01242,0.00565,0.0,0.00281,0.0074
d,0.03669,0.05076,0.0501,0.04992,0.03044,0.03736,0.05206,0.04702,0.03725,0.05933,0.05858,0.01575,0.01043,0.03475
e,0.14715,0.16396,0.12181,0.1257,0.08995,0.11792,0.09912,0.10149,0.07352,0.17324,0.15453,0.06418,0.07968,0.07562


## Problem 7 (10 Points)

Create a variable `polish` and set it equal to the column for `"Polish"` from the data frame. Print the first 26 values of this series. Which English letters are not used in Polish?

Use the `.loc` accessor to get the row in the original data set for `"c"`. Print this row. Then print the log of the row (use `np.log()`). 

Which language has the highest probability of *c*? 
How can you tell from the log probabilities which language has the highest probability?
Which language does not use the letter *c*? What happens to the log probability for that language?

In [12]:
# your code here
polish = letter_data["Polish"]
print(polish.head(n=26) ) 

crow = letter_data.loc["c"]
print(crow) 

print(np.log(crow)) 

Letter
a    0.10503
b    0.01740
c    0.03895
d    0.03725
e    0.07352
f    0.00143
g    0.01731
h    0.01015
i    0.08328
j    0.01836
k    0.02753
l    0.02564
m    0.02515
n    0.06237
o    0.06667
p    0.02445
q    0.00000
r    0.05243
s    0.05224
t    0.02475
u    0.02062
v    0.00012
w    0.05813
x    0.00004
y    0.03206
z    0.04852
Name: Polish, dtype: float64
French        0.03260
German        0.02732
Spanish       0.04019
Portuguese    0.03882
Esperanto     0.00776
Italian       0.04501
Turkish       0.01463
Swedish       0.01486
Polish        0.03895
Dutch         0.01242
Danish        0.00565
Icelandic     0.00000
Finnish       0.00281
Czech         0.00740
Name: c, dtype: float64
French       -3.423443
German       -3.600136
Spanish      -3.214137
Portuguese   -3.248820
Esperanto    -4.858773
Italian      -3.100871
Turkish      -4.224681
Swedish      -4.209082
Polish       -3.245477
Dutch        -4.388447
Danish       -5.176100
Icelandic         -inf
Finnish      -5.87

  result = getattr(ufunc, method)(*inputs, **kwargs)


**Answer here**

q, x, and v do not seem to be used in polish.

Italian has the highest probability and from log probabilities I can tell because Italian has the highest log probability (they're all negative and it is the closest to 0) Icelandic does not use the letter C and for this, the log probability is negative infinity. 


## Problem 8 (10 Points)

Create a function `get_scores` that takes one argument, a string `s`.

In this function:

* Set `s` equal to `s` but lower case.
* Use the function `np.zeros` to create a variable `language_scores` that allocates an array of 14 zeros (one for each language).
* Write a `for` loop that iterates over each letter in `s`. Remember that a Python string is an array of letters! If the letter is in the `index` for `letter_data`, add the log of the row for the letter to `language_scores` (remember in numpy you can add the entire array with one operation).
* Return `language_scores`.

Use the `get_scores` function to evaluate the log probability of the string `"abc"`.  The log function may produce errors, you can ignore these. The value for French should be -10.705159 and `-inf` for Icelandic.

Which language is most likely to have produced this string? Why is Icelandic negative infinity?

In [13]:
# your code here

def get_scores(s): 
    language_scores = np.zeros(14) 
    for i in s: 
        if i in letter_data.index: 
            language_scores+= (np.log(letter_data.loc[i])) 
    return language_scores

get_scores("abc")

French       -10.705159
German       -10.301758
Spanish       -9.184706
Portuguese    -9.733711
Esperanto    -11.594707
Italian       -9.923585
Turkish       -9.831033
Swedish      -10.751993
Polish        -9.550271
Dutch        -11.125800
Danish       -11.897375
Icelandic          -inf
Finnish      -13.851483
Czech        -12.181902
Name: a, dtype: float64

**Answer here**

the log probability for spanish is the closet to 0 so it is the most likely to have produced this string. Icelandic is negative infinity because the string contains the letter c, which does not occur in icelandic, so icelandic could not have produced this string (according to our previous coding work and knowledge). 

## Problem 9 (10 Points)

We have selected several short passages from Wikipedias in different languages. These languages are (in alphabetical order---not necessarily the order of the questions below) Dutch, Finnish, German, Icelandic, Italian, Polish, and Portuguese. Attempt to identify each language by computing log-likelihood of each observation under each language model. Compare your guess with your own knowledge OR the result you get by using Google Translate to auto-detect the language of each passage. If your guess doesn't agree with the auto-detected language, comment on why that might be (for example, by specifying any letters or phrases that might be particularly identifying of a language).

**(9.1)** `Shorttrack is een schaatsrace op een ijshockeybaan. In tegenstelling tot het langebaanschaatsen is de tijd van een rijder niet van belang: vier tot zes rijders starten tegelijk en wie als eerste over de finish komt, heeft gewonnen.`

In [14]:
# your code here
s = "Shorttrack is een schaatsrace op een ijshockeybaan. In tegenstelling tot het langebaanschaatsen is de tijd van een rijder niet van belang: vier tot zes rijders starten tegelijk en wie als eerste over de finish komt, heeft gewonnen."
get_scores(s)

French       -575.739088
German       -547.545694
Spanish      -585.844154
Portuguese   -600.353705
Esperanto           -inf
Italian      -603.701997
Turkish             -inf
Swedish      -552.569848
Polish       -607.324313
Dutch        -532.972669
Danish       -554.421066
Icelandic           -inf
Finnish      -583.866066
Czech        -614.471165
Name: h, dtype: float64

**Answer here**

Dutch        -532.972669 is the highest log-likelihood. This is the correct answer. 


**(9.2)** `Per poetica e pensiero di Alessandro Manzoni si intendono le convinzioni poetiche, stilistiche, linguistiche ed ideologiche che hanno delineato la parabola esistenziale e letteraria di Manzoni dagli esordi giacobini e neoclassici fino alla morte.`

In [15]:
# your code here
s = 'Per poetica e pensiero di Alessandro Manzoni si intendono le convinzioni poetiche, stilistiche, linguistiche ed ideologiche che hanno delineato la parabola esistenziale e letteraria di Manzoni dagli esordi giacobini e neoclassici fino alla morte.'
get_scores(s)

French       -588.514487
German       -594.089831
Spanish      -582.306212
Portuguese   -595.651396
Esperanto    -597.590241
Italian      -563.438855
Turkish      -609.347687
Swedish      -606.164488
Polish       -617.853315
Dutch        -581.886110
Danish       -616.670512
Icelandic           -inf
Finnish      -632.107166
Czech        -645.158867
Name: e, dtype: float64

**Answer here**
Italian      -563.438855 is the highest log likelihood. This is the correct answer. 


**(9.3)** `Kiinalaisen Wuhanin kaupungin hallinto määräsi koronaviruksen oireita osoittavat henkilöt erityiselle karanteenivyöhykkeelle hallintopakon uhalla.`

In [16]:
# your code here
s = 'Kiinalaisen Wuhanin kaupungin hallinto määräsi koronaviruksen oireita osoittavat henkilöt erityiselle karanteenivyöhykkeelle hallintopakon uhalla'
get_scores(s)

French              -inf
German       -412.444311
Spanish             -inf
Portuguese          -inf
Esperanto           -inf
Italian             -inf
Turkish             -inf
Swedish      -386.896675
Polish              -inf
Dutch               -inf
Danish              -inf
Icelandic           -inf
Finnish      -368.933277
Czech               -inf
Name: i, dtype: float64

**Answer here**

Finnish      -368.933277 is the highest log-likelihood, this is the correct answer. 

**(9.4)** `Útganga Breta úr Evrópusambandinu eða í daglegu tali Brexit (sambland af ensku orðunum British „breskur“ og exit „útganga“) var úrsögn Bretlands úr Evrópusambandinu (ESB).`

In [17]:
# your code here
s = 'Útganga Breta úr Evrópusambandinu eða í daglegu tali Brexit (sambland af ensku orðunum British „breskur“ og exit „útganga“) var úrsögn Bretlands úr Evrópusambandinu (ESB).'
get_scores(s)

French              -inf
German              -inf
Spanish             -inf
Portuguese          -inf
Esperanto           -inf
Italian             -inf
Turkish             -inf
Swedish             -inf
Polish              -inf
Dutch               -inf
Danish              -inf
Icelandic    -406.460008
Finnish             -inf
Czech               -inf
Name: t, dtype: float64

**Answer here**

Icelandic has the highest log likelihood and this is the correct answer. 


**(9.5)** `Ihre Bausteine sind vier verschiedene Nukleotide, die jeweils aus einem Phosphatrest, dem Zucker Desoxyribose und einer von vier organischen Basen bestehen.`

In [18]:
# your code here
s = 'Ihre Bausteine sind vier verschiedene Nukleotide, die jeweils aus einem Phosphatrest, dem Zucker Desoxyribose und einer von vier organischen Basen bestehen.'
get_scores(s)

French       -371.322146
German       -360.742677
Spanish      -381.337360
Portuguese   -388.961882
Esperanto           -inf
Italian      -393.382415
Turkish             -inf
Swedish      -370.749294
Polish       -412.096882
Dutch        -358.896398
Danish       -368.541993
Icelandic           -inf
Finnish      -388.617096
Czech        -400.647911
Name: h, dtype: float64

**Answer here**

German       -360.742677 has the highest log likelihood and this is the correct answer. 

**(9.6)** `Facebook, a maior mídia social e rede social virtual do mundo, é fundada por Mark Zuckerberg e seus colegas Eduardo Saverin, Andrew McCollum, Dustin Moskovitz e Chris Hughes, alunos da Universidade de Harvard.`

In [19]:
# your code here
s = 'Facebook, a maior mídia social e rede social virtual do mundo, é fundada por Mark Zuckerberg e seus colegas Eduardo Saverin, Andrew McCollum, Dustin Moskovitz e Chris Hughes, alunos da Universidade de Harvard'
get_scores(s)

French              -inf
German              -inf
Spanish      -483.625323
Portuguese   -483.514258
Esperanto           -inf
Italian             -inf
Turkish             -inf
Swedish             -inf
Polish              -inf
Dutch               -inf
Danish              -inf
Icelandic           -inf
Finnish             -inf
Czech        -510.976585
Name: a, dtype: float64

**Answer here**
Portuguese   -483.514258 has the highest log-likelihood and this is the correct answer

**(9.7)** `Od początku zawodowej kariery zawodnik Los Angeles Lakers. Razem z Shaquille’em O’Nealem poprowadził zespół do trzech mistrzostw z rzędu w latach 2000–2002. Po odejściu z Lakers O’Neala, Bryant został główną postacią klubu.`

In [20]:
# your code here
s = 'Od początku zawodowej kariery zawodnik Los Angeles Lakers. Razem z Shaquille’em O’Nealem poprowadził zespół do trzech mistrzostw z rzędu w latach 2000–2002. Po odejściu z Lakers O’Neala, Bryant został główną postacią klubu.'
get_scores(s)

French       -inf
German       -inf
Spanish      -inf
Portuguese   -inf
Esperanto    -inf
Italian      -inf
Turkish      -inf
Swedish      -inf
Polish       -inf
Dutch        -inf
Danish       -inf
Icelandic    -inf
Finnish      -inf
Czech        -inf
Name: d, dtype: float64

**Answer here**

the output returns `-inf` for each language so we are not really able to make a guess. the correct answer is polish, and the reason why our code ruled out polish is because the sentence contains the letter "q" in someone's name, which of course is allowed in the polish language but our code/algorithm has not made an allowance for this: q can be used in polish in names and quotes, but it is not a naturally occuring letter. Our code doesn't make this key distinction