## Probability and Coding

#### 1. Conditional Probability and Indepdenence

1. **Probability** 

    $\displaystyle \LARGE \Pr(A)\quad \textrm{or} \quad\Pr(X=x)$<br><br>
    
2. **Conditional Probability** 

    $\displaystyle \Huge \Pr(\;A\,|\,B\;)\quad$ or $\quad\Pr(\; Y=y\,|\,X=x\;)$<br>
    
    ChatBots are something like the following specifications...

    1. **Markov**: $\Pr(\; W_{i+1}=w_{i+1}\,|\,W_i=w_i\;)$  
    2. **Bigram**: $\Pr(\; W_{i+2}=w_{i+2}\,|\, W_{i+1}=w_{i+1}, W_i=w_i\;)$  
    3. **Trigram**: $\Pr(\; W_{i+3}=w_{i+3} \,|\, W_{i+2}=w_{i+2}, W_{i+1}=w_{i+1}, W_i=w_i\;)$ 
    4. **Context**: $\Pr(\; W_{i+3}=w_{i+3} \,|\, W_{i+1}=w_{i+1}, W_i=w_i, C=c\;)$<br><br>

3. **Independence** 

    $\displaystyle \Huge \Pr(A)=\Pr(\;A\,|\,B\;)\quad$ or $\quad\Pr(Y=y) = \Pr(\; Y=y\,|\,X=x\;)$

#### 2. Multinomial distributions

1. `from scipy import stats`
2. `stats.multinomial(p=probability, n=categories).rvs(size=attempts)`
3. `import numpy as np` and `np.array()`
4. `np.random.seed(initialization)` and `np.random.choice(options, size=draws, replace=True, p=None)`

#### 3. python string manipulation for a Markovian ChatBot

- `avatar.dtypes` and `df.col.str.upper()`
    - `.replace` and `import re` "regular expressions" ("regexp") are demonstrated but will not be tested 
- **Operator overloading** `+` and `.sum().split(' ')`
- `for i in range(n)` and `for x in lst` and `for i,x in enumerate(lst)`
- `list()` and `dict()`
- `if`/`else`


In [3]:
import pandas as pd 
url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
# fail https://github.com/KeithGalli/pandas/blob/master/pokemon_data.csv
pokeaman = pd.read_csv(url)
pokeaman

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [6]:
pokeaman.sort_values?

In [7]:
pokeaman.sort_values("Attack", ascending=False)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
163,150,MewtwoMega Mewtwo X,Psychic,Fighting,106,190,100,154,100,130,1,True
232,214,HeracrossMega Heracross,Bug,Fighting,80,185,115,40,105,75,2,False
424,383,GroudonPrimal Groudon,Ground,Fire,100,180,160,150,90,90,3,True
426,384,RayquazaMega Rayquaza,Dragon,Flying,105,180,100,180,100,115,3,True
429,386,DeoxysAttack Forme,Psychic,,50,180,20,180,20,150,3,True
...,...,...,...,...,...,...,...,...,...,...,...,...
139,129,Magikarp,Water,,20,10,55,15,20,80,1,False
261,242,Blissey,Normal,,255,10,10,75,135,55,2,False
230,213,Shuckle,Bug,Rock,20,10,230,10,230,5,2,False
121,113,Chansey,Normal,,250,5,5,35,105,50,1,False


In [10]:
pokeaman[pokeaman['Legendary']]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
156,144,Articuno,Ice,Flying,90,85,100,95,125,85,1,True
157,145,Zapdos,Electric,Flying,90,90,85,125,90,100,1,True
158,146,Moltres,Fire,Flying,90,100,90,125,85,90,1,True
162,150,Mewtwo,Psychic,,106,110,90,154,90,130,1,True
163,150,MewtwoMega Mewtwo X,Psychic,Fighting,106,190,100,154,100,130,1,True
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [15]:
pokeaman[(pokeaman['Type 1'] == 'Ghost') & (pokeaman['Type 2'] == 'Ghost')]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary


In [18]:
pokeaman[(pokeaman['Attack'] < 100) & (pokeaman['Defense'] > 100)].sort_values("Name")[6:]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
376,344,Claydol,Ground,Psychic,60,70,105,70,120,75,3,False
98,91,Cloyster,Water,Ice,50,95,180,85,45,70,1,False
699,638,Cobalion,Steel,Fighting,91,90,129,90,72,108,5,True
624,563,Cofagrigus,Ghost,,58,50,145,95,105,30,5,False
546,488,Cresselia,Psychic,,120,70,120,75,130,85,4,False
619,558,Crustle,Bug,Rock,70,95,125,65,75,45,5,False
616,555,DarmanitanZen Mode,Fire,Psychic,105,30,105,140,105,55,5,False
430,386,DeoxysDefense Forme,Psychic,,50,70,160,70,160,90,3,True
502,452,Drapion,Poison,Dark,70,90,110,60,75,95,4,False
389,356,Dusclops,Ghost,,40,70,130,60,130,25,3,False


In [None]:
pokeaman.iloc[ , ]

In [20]:
stats.multinomial?

In [22]:
from scipy import stats
stats.multinomial(p=[.7,.1,.2], n=1).rvs(size=1)

#(p=probability, n=categories).rvs(size=attempts)

array([[1, 0, 0]])

In [None]:
import numpy as np
np.random.choice?

In [23]:
import pandas as pd
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-08-11/avatar.csv"
avatar = pd.read_csv(url) #avatar.isnull().sum() #avatar[avatar.isnull().sum(axis=1)>0]
avatar[:10]

Unnamed: 0,id,book,book_num,chapter,chapter_num,character,full_text,character_words,writer,director,imdb_rating
0,1,Water,1,The Boy in the Iceberg,1,Katara,Water. Earth. Fire. Air. My grandmother used t...,Water. Earth. Fire. Air. My grandmother used t...,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
1,2,Water,1,The Boy in the Iceberg,1,Scene Description,"As the title card fades, the scene opens onto ...",,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
2,3,Water,1,The Boy in the Iceberg,1,Sokka,It's not getting away from me this time. [Clos...,It's not getting away from me this time. Watc...,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
3,4,Water,1,The Boy in the Iceberg,1,Scene Description,"The shot pans quickly from the boy to Katara, ...",,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
4,5,Water,1,The Boy in the Iceberg,1,Katara,"[Happily surprised.] Sokka, look!","Sokka, look!","‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
5,6,Water,1,The Boy in the Iceberg,1,Sokka,"[Close-up of Sokka; whispering.] Sshh! Katara,...","Sshh! Katara, you're going to scare it away. ...","‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
6,7,Water,1,The Boy in the Iceberg,1,Scene Description,"Behind Sokka, Katara is still making circular ...",,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
7,8,Water,1,The Boy in the Iceberg,1,Katara,[Struggling with the water that passes right i...,"But, Sokka! I caught one!","‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
8,9,Water,1,The Boy in the Iceberg,1,Scene Description,The bubble containing her fish slowly drifts a...,,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
9,10,Water,1,The Boy in the Iceberg,1,Katara,[Exclaims indignantly.] Hey!,Hey!,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1


In [38]:
avatar.dtypes

id                   int64
book                object
book_num             int64
chapter             object
chapter_num          int64
character           object
full_text           object
character_words     object
writer              object
director            object
imdb_rating        float64
dtype: object

In [39]:
print((avatar['character'].str.upper()+": " + avatar['full_text']+"\n\n")[:4].sum())

KATARA: Water. Earth. Fire. Air. My grandmother used to tell me stories about the old days: a time of peace when the Avatar kept balance between the Water Tribes, Earth Kingdom, Fire Nation and Air Nomads. But that all changed when the Fire Nation attacked. Only the Avatar mastered all four elements; only he could stop the ruthless firebenders. But when the world needed him most, he vanished. A hundred years have passed, and the Fire Nation is nearing victory in the war. Two years ago, my father and the men of my tribe journeyed to the Earth Kingdom to help fight against the Fire Nation, leaving me and my brother to look after our tribe. Some people believe that the Avatar was never reborn into the Air Nomads and that the cycle is broken, but I haven't lost hope. I still believe that, somehow, the Avatar will return to save the world.

SCENE DESCRIPTION: As the title card fades, the scene opens onto a shot of an icy sea before panning slowly to the left, revealing more towering iceberg

In [None]:
avatar.dtypes

In [40]:
avatar.character.value_counts()#[:10]

character
Scene Description    3393
Aang                 1796
Sokka                1639
Katara               1437
Zuko                  776
                     ... 
The Hippo               1
Audience                1
Young Mai               1
Old woman               1
Katara and Sokka        1
Name: count, Length: 374, dtype: int64

In [41]:
avatar.chapter.value_counts()#[:10]

chapter
The Fortuneteller                          331
The Warriors of Kyoshi                     304
City of Walls and Secrets                  293
Jet                                        290
The Desert                                 286
                                          ... 
The Siege of the North, Part 2             161
Sozin's Comet, Part 3: Into the Inferno    151
The Blue Spirit                            144
Appa's Lost Days                           106
Sozin's Comet, Part 4: Avatar Aang          91
Name: count, Length: 61, dtype: int64

In [42]:
#words = ("\n"+avatar.dropna().character.str.upper()+": "+avatar.dropna().character_words+" ").sum().split(' ')
#words = ("\n"+avatar.dropna().character.str.upper()+": "+avatar.dropna().character_words+" ").sum().split(' ')
words = ("\n"+avatar.character.str.upper().replace(' ','.')+": "+avatar.full_text+" ").sum().split(' ')

In [43]:
#from collections import defaultdict
word_used = dict()#defaultdict(int)
next_word = dict()#defaultdict(lambda: defaultdict(int))
for i,word in enumerate(words[:-1]):
    
    if word in word_used:
        word_used[word] += 1
    else: 
        word_used[word] = 1
        next_word[word] = {}
        
    if words[i+1] in next_word[word]:
        next_word[word][words[i+1]] += 1 
    else:

        next_word[word][words[i+1]] = 1

In [45]:
word_used

{'\nKATARA:': 1437,
 'Water.': 3,
 'Earth.': 5,
 'Fire.': 6,
 'Air.': 3,
 'My': 143,
 'grandmother': 5,
 'used': 63,
 'to': 12764,
 'tell': 129,
 'me': 469,
 'stories': 13,
 'about': 439,
 'the': 18112,
 'old': 120,
 'days:': 1,
 'a': 7911,
 'time': 194,
 'of': 7711,
 'peace': 9,
 'when': 325,
 'Avatar': 410,
 'kept': 5,
 'balance': 25,
 'between': 111,
 'Water': 139,
 'Tribes,': 1,
 'Earth': 193,
 'Kingdom,': 7,
 'Fire': 753,
 'Nation': 368,
 'and': 8400,
 'Air': 70,
 'Nomads.': 4,
 'But': 334,
 'that': 1168,
 'all': 592,
 'changed': 18,
 'attacked.': 4,
 'Only': 26,
 'mastered': 11,
 'four': 94,
 'elements;': 1,
 'only': 244,
 'he': 1572,
 'could': 226,
 'stop': 140,
 'ruthless': 3,
 'firebenders.': 10,
 'world': 63,
 'needed': 15,
 'him': 976,
 'most,': 2,
 'vanished.': 3,
 'A': 399,
 'hundred': 60,
 'years': 78,
 'have': 802,
 'passed,': 4,
 'is': 2821,
 'nearing': 4,
 'victory': 6,
 'in': 3892,
 'war.': 20,
 'Two': 49,
 'ago,': 14,
 'my': 737,
 'father': 61,
 'men': 53,
 'tribe': 

In [46]:
next_word['Water.']

{'Earth.': 2, '[Shot': 1}

In [48]:
import numpy as np
from scipy import stats

In [49]:
current_word = "\nKatara:".upper()
print(current_word, end=' ')
for i in range(100):
    probability_of_next_word = np.array(list(next_word[current_word].values()))/word_used[current_word]
    randomly_chosen_next_word = stats.multinomial(p=probability_of_next_word, n=1).rvs(size=1)[0,:]
    current_word = np.array(list(next_word[current_word].keys()))[1==randomly_chosen_next_word][0]
    print(current_word, end=' ')


KATARA: [Calling out.] 
TOPH: The scene in appreciation and turns and spreads her left arm, regaining a little words, this time. 
IROH: [Disinterested.] You think it's "boiled in the room.] Don't lie! You taught us anyway? [He looks extremely furious face. 
AANG: [Exclaims indignantly.] Hey! 
IROH: Oh that's enough. 
SCENE DESCRIPTION: Katara grabs Oyaji appears embarrassed Sokka slowly starts again, facing off the reward we'll find you! 
SCENE DESCRIPTION: The camera pans downward to the crew looks as Chan thoughtfully rubs her spirit want anyone but he comes across the courtyard. Cut to start over. 
AANG: I'm sorry, Katara. 
SOKKA: [Cut 

In [50]:
import re
avatar.full_text = avatar.full_text.apply(lambda string: re.sub(r'\[.*?\]', lambda match: match.group(0).replace(' ', '_ '), string))
avatar.loc[avatar.character=='Scene Description','full_text'] = avatar.full_text[avatar.character=='Scene Description'].str.replace(' ', '- ')
words = ("\n"+avatar.character.str.upper().str.replace(' ','.')+": "+avatar.full_text+" ").sum().split(' ')

In [51]:
avatar[:10]

Unnamed: 0,id,book,book_num,chapter,chapter_num,character,full_text,character_words,writer,director,imdb_rating
0,1,Water,1,The Boy in the Iceberg,1,Katara,Water. Earth. Fire. Air. My grandmother used t...,Water. Earth. Fire. Air. My grandmother used t...,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
1,2,Water,1,The Boy in the Iceberg,1,Scene Description,"As- the- title- card- fades,- the- scene- open...",,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
2,3,Water,1,The Boy in the Iceberg,1,Sokka,It's not getting away from me this time. [Clos...,It's not getting away from me this time. Watc...,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
3,4,Water,1,The Boy in the Iceberg,1,Scene Description,The- shot- pans- quickly- from- the- boy- to- ...,,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
4,5,Water,1,The Boy in the Iceberg,1,Katara,"[Happily_ surprised.] Sokka, look!","Sokka, look!","‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
5,6,Water,1,The Boy in the Iceberg,1,Sokka,[Close-up_ of_ Sokka;_ whispering.] Sshh! Kata...,"Sshh! Katara, you're going to scare it away. ...","‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
6,7,Water,1,The Boy in the Iceberg,1,Scene Description,"Behind- Sokka,- Katara- is- still- making- cir...",,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
7,8,Water,1,The Boy in the Iceberg,1,Katara,[Struggling_ with_ the_ water_ that_ passes_ r...,"But, Sokka! I caught one!","‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
8,9,Water,1,The Boy in the Iceberg,1,Scene Description,The- bubble- containing- her- fish- slowly- dr...,,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1
9,10,Water,1,The Boy in the Iceberg,1,Katara,[Exclaims_ indignantly.] Hey!,Hey!,"‎Michael Dante DiMartino, Bryan Konietzko, Aar...",Dave Filoni,8.1


In [53]:
from collections import defaultdict
word_used2 = defaultdict(int)
next_word2 = defaultdict(lambda: defaultdict(int))
for i,word in enumerate(words[:-2]):
    word_used2[word+' '+words[i+1]] += 1
    next_word2[word+' '+words[i+1]][words[i+2]] += 1 

In [None]:
next_word2

In [54]:
current_word_1 = "\nKatara:".upper()
current_word_2 = "Water."
print(current_word_1, end=' ')
print(current_word_2, end=' ')
for i in range(100):
    probability_of_next_word = np.array(list(next_word2[current_word_1+' '+current_word_2].values()))/word_used2[current_word_1+' '+current_word_2]
    randomly_chosen_next_word = stats.multinomial(p=probability_of_next_word, n=1).rvs(size=1)[0,:]
    current_word_1,current_word_2 = current_word_2,np.array(list(next_word2[current_word_1+' '+current_word_2].keys()))[1==randomly_chosen_next_word][0]
    print(current_word_2.replace('_', '').replace('-', ''), end=' ')


KATARA: Water. Earth. Fire. Air. Long ago, the four elements talk is sounding like Avatar stuff. 
IROH: It was an assassin, Sokka. 
SCENE.DESCRIPTION: Aang lands and falls backward onto the ground. 
KATARA: [Nervously.] I've changed my mind. [Desperately.] Please, Uncle, I'm sorry. I didn't mean to her. 
SCENE.DESCRIPTION: Sokka takes out a huge splash that soaks and almost knock his cart over. Aang sends a stream of air, which Jet ducks. Jet attacks Aang. Cut to the ground, and starts to lower his arms, turns around, showing his control of their bowls. 
KATARA: [Delighted.] You're a horrible person, and the other warriors 

In [None]:
word_used3 = defaultdict(int)
next_word3 = defaultdict(lambda: defaultdict(int))
for i,word in enumerate(words[:-3]):
    word_used3[word+' '+words[i+1]+' '+words[i+2]] += 1
    next_word3[word+' '+words[i+1]+' '+words[i+2]][words[i+3]] += 1 

In [None]:
current_word_1 = "\nKatara:".upper()
current_word_2 = "Water."
current_word_3 = "Earth."
print(current_word_1, end=' ')
print(current_word_2, end=' ')
print(current_word_3, end=' ')
for i in range(100):
    probability_of_next_word = np.array(list(next_word3[current_word_1+' '+current_word_2+' '+current_word_3].values()))/word_used3[current_word_1+' '+current_word_2+' '+current_word_3]
    randomly_chosen_next_word = stats.multinomial(p=probability_of_next_word, n=1).rvs(size=1)[0,:]
    current_word_1,current_word_2,current_word_3 = current_word_2,current_word_3,np.array(list(next_word3[current_word_1+' '+current_word_2+' '+current_word_3].keys()))[1==randomly_chosen_next_word][0]
    print(current_word_3.replace('_', '').replace('-', ''), end=' ')

In [None]:
from collections import Counter, defaultdict
characters = Counter("\n"+avatar.character.str.upper().str.replace(' ','.')+":")

nested_dict = lambda: defaultdict(nested_dict)
word_used2C = nested_dict()
next_word2C = nested_dict()

for i,word in enumerate(words[:-2]):
    
    if word in characters:
        character = word
        
    if character not in word_used2C:
        word_used2C[character] = dict()
    if word+' '+words[i+1] not in word_used2C[character]:
        word_used2C[character][word+' '+words[i+1]] = 0
    word_used2C[character][word+' '+words[i+1]] += 1

    if character not in next_word2C:
        next_word2C[character] = dict()
    if word+' '+words[i+1] not in next_word2C[character]:
        next_word2C[character][word+' '+words[i+1]] = dict()
    if words[i+2] not in next_word2C[character][word+' '+words[i+1]]:
        next_word2C[character][word+' '+words[i+1]][words[i+2]] = 0
    next_word2C[character][word+' '+words[i+1]][words[i+2]] += 1
        
        

In [None]:
current_word_1 = "\nKatara:".upper()
current_word_2 = "Water."
print(current_word_1, end=' ')
print(current_word_2, end=' ')
for i in range(100):
    if current_word_1 in characters:
        character = current_word_1

    probability_of_next_word = np.array(list(next_word2C[character][current_word_1+' '+current_word_2].values()))/word_used2C[character][current_word_1+' '+current_word_2]
    randomly_chosen_next_word = stats.multinomial(p=probability_of_next_word, n=1).rvs(size=1)[0,:]
    current_word_1,current_word_2 = current_word_2,np.array(list(next_word2C[character][current_word_1+' '+current_word_2].keys()))[1==randomly_chosen_next_word][0]
    print(current_word_2.replace('_', '').replace('-', ''), end=' ')