# Rick and Morty Sentiment analysis
### Author: Devin McCormack
#### Initiated Mar/13/18

Scraping the transcripts of all 3 seasons of Rick and Morty from the Rick and Morty [Rickipedia](http://rickandmorty.wikia.com/wiki/Rickipedia), I will do some analysis on the main characters of Rick and Morty, looking at how much they talk, how complex their vocabulary is (including made up words), who they talk to, and the sentiment of the conversations.

## Gather

In [129]:
from bs4 import BeautifulSoup
import pandas
import glob
import os
import requests
import csv

seasonurls=['http://rickandmorty.wikia.com/wiki/Category:Season_1_transcripts',
            'http://rickandmorty.wikia.com/wiki/Category:Season_2_transcripts',
            'http://rickandmorty.wikia.com/wiki/Category:Season_3_transcripts']



In [17]:
url.split('/')[-1].split(':')[-1]

'Season_3_transcripts'

#### Pull season pages (to crawl for episode transcripts)

In [18]:
folder_name='seasonlinks'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)


for url in seasonurls:
    response=requests.get(url)
    with open(os.path.join(folder_name,
                           url.split('/')[-1].split(':')[-1]),mode='wb') as file:
        file.write(response.content)

#### Crawl for transcripts

In [48]:
os.path.join(folder_name,'Season_1_transcripts')

'seasonlinks/Season_1_transcripts'

In [73]:
home='http://rickandmorty.wikia.com'
transcripts=[]
with open(os.path.join(folder_name,'Season_1_transcripts')) as file:
    soup = BeautifulSoup(file,'lxml')
    link=soup.find_all('div',{'id':"mw-pages"})
    for trans in link[0].find_all('a'):
        transcripts.append(home+trans.get('href'))
        
print(transcripts)

['http://rickandmorty.wikia.com/wiki/Anatomy_Park_(episode)/Transcript', 'http://rickandmorty.wikia.com/wiki/Close_Rick-counters_of_the_Rick_Kind/Transcript', 'http://rickandmorty.wikia.com/wiki/Lawnmower_Dog/Transcript', 'http://rickandmorty.wikia.com/wiki/M._Night_Shaym-Aliens!/Transcript', 'http://rickandmorty.wikia.com/wiki/Meeseeks_and_Destroy/Transcript', 'http://rickandmorty.wikia.com/wiki/Pilot/Transcript', 'http://rickandmorty.wikia.com/wiki/Raising_Gazorpazorp/Transcript', 'http://rickandmorty.wikia.com/wiki/Rick_Potion_9/Transcript', 'http://rickandmorty.wikia.com/wiki/Ricksy_Business/Transcript', 'http://rickandmorty.wikia.com/wiki/Rixty_Minutes/Transcript', 'http://rickandmorty.wikia.com/wiki/Something_Ricked_This_Way_Comes/Transcript']


In [77]:
transcripts[0].split('/')[-2]

'Anatomy_Park_(episode)'

In [74]:
seasonfolder=['Season_1','Season_2','Season_3']
for fold in seasonfolder:
    foldtemp=os.path.join(folder_name,fold)
    if not os.path.exists(foldtemp):
        os.makedirs(foldtemp)

In [80]:
folder_name

'seasonlinks'

#### Isolate links to transcripts

In [146]:
transcripts=[]
with open(os.path.join(folder_name,'Season_1_transcripts',)) as file:
    soup = BeautifulSoup(file,'lxml')
    link=soup.find_all('div',{'id':"mw-pages"})
    for trans in link[0].find_all('a'):
        transcripts.append(home+trans.get('href'))
        
print(transcripts)

['http://rickandmorty.wikia.com/wiki/Anatomy_Park_(episode)/Transcript', 'http://rickandmorty.wikia.com/wiki/Close_Rick-counters_of_the_Rick_Kind/Transcript', 'http://rickandmorty.wikia.com/wiki/Lawnmower_Dog/Transcript', 'http://rickandmorty.wikia.com/wiki/M._Night_Shaym-Aliens!/Transcript', 'http://rickandmorty.wikia.com/wiki/Meeseeks_and_Destroy/Transcript', 'http://rickandmorty.wikia.com/wiki/Pilot/Transcript', 'http://rickandmorty.wikia.com/wiki/Raising_Gazorpazorp/Transcript', 'http://rickandmorty.wikia.com/wiki/Rick_Potion_9/Transcript', 'http://rickandmorty.wikia.com/wiki/Ricksy_Business/Transcript', 'http://rickandmorty.wikia.com/wiki/Rixty_Minutes/Transcript', 'http://rickandmorty.wikia.com/wiki/Something_Ricked_This_Way_Comes/Transcript']


#### Scrape transcripts from website

In [162]:
home='http://rickandmorty.wikia.com'
for season in seasonfolder:
    transcripts=[]
    with open(os.path.join(folder_name,season+'_transcripts')) as file:
        soup = BeautifulSoup(file,'lxml')
        link=soup.find_all('div',{'id':"mw-pages"})
        for trans in link[0].find_all('a'):
            transcripts.append(home+trans.get('href'))
    for url in transcripts:
        response=requests.get(url)
        print(season)
        with open(os.path.join(folder_name,season,
                               url.split('/')[-2]+'.txt'),mode='wb') as file:
            file.write(response.content)
    


Season_1
Season_1
Season_1
Season_1
Season_1
Season_1
Season_1
Season_1
Season_1
Season_1
Season_1
Season_2
Season_2
Season_2
Season_2
Season_2
Season_2
Season_2
Season_2
Season_2
Season_2
Season_3
Season_3
Season_3
Season_3
Season_3
Season_3
Season_3
Season_3
Season_3
Season_3


#### Isolate text from files

In [163]:
with open ('seasonlinks/Season_1/Close_Rick-counters_of_the_Rick_Kind.txt',encoding='utf-8') as file:
    soup=BeautifulSoup(file,'lxml')
    print(soup.find('div',{'id':"mw-content-text"}).find_all('p')[0:2])

[<p>This article is a <b>transcript</b> of the <a class="text" href="http://rickandmorty.wikia.com/wiki/Rick_and_Morty">Rick and Morty</a> episode "<a href="/wiki/Close_Rick-counters_of_the_Rick_Kind" title="Close Rick-counters of the Rick Kind">Close Rick-counters of the Rick Kind</a>" from season 1, which aired on April 7, 2014.
</p>, <p dir="ltr" id="docs-internal-guid-c77db3c5-dc8a-d408-2c5b-1ed32175c975" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;font-weight:400;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;">[The</span><a class="text" href="http://rickandmorty.wikia.com/wiki/Smith_Family"><span style="font-size:11pt;font-family:Arial;color:#000000;font-weight:400;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;"> </span></a><span style="font-size:11pt;font-family:Arial;color:#000000;font-weight:400;font-style:italic;font-variant:norma

#### Build nested loop: single file

In [166]:
script=[]
sentencecount=0
with open ('seasonlinks/Season_1/Close_Rick-counters_of_the_Rick_Kind.txt',encoding='utf-8') as file:
    soup=BeautifulSoup(file,'lxml')
    for line in soup.find('div',{'id':"mw-content-text"}).find_all('p'):
        sentencecount+=1
        script.append([str(sentencecount),'1','Close_Rick-counters_of_the_Rick_Kind',line.getText()])



with open("Close_Rick-counters_of_the_Rick_Kind_test.csv",'w') as file:
    wr = csv.writer(file, delimiter=',')
    wr.writerows([['SentenceID','Season','Episode','Sentence']])
    wr.writerows(script)

#### Build the nested loop, single season

In [174]:
season=os.path.join(folder_name,seasonfolder[0])
seasoncount=1
for transcript in glob.glob(os.path.join(season,'*.txt')):
    sentencecount=0
    script=[]
    epname=transcript.split('/')[-1][:-4]
    with open (transcript,encoding='utf-8') as file:
        soup=BeautifulSoup(file,'lxml')
        for line in soup.find('div',{'id':"mw-content-text"}).find_all('p'):
            sentencecount+=1
            script.append([str(sentencecount),str(seasoncount),epname,line.getText()])

       
    writename=transcript[:-4]+'.csv'
    with open(writename,'w') as file:
        wr = csv.writer(file, delimiter=',')
        wr.writerows([['SentenceID','Season','Episode','Sentence']])
        wr.writerows(script)

#### All seasons

In [175]:
seasoncount=0
for season in seasonfolder:
    filename=os.path.join(folder_name,season)
    for transcript in glob.glob(os.path.join(filename,'*.txt')):
        sentencecount=0
        script=[]
        epname=transcript.split('/')[-1][:-4]
        with open (transcript,encoding='utf-8') as file:
            soup=BeautifulSoup(file,'lxml')
            for line in soup.find('div',{'id':"mw-content-text"}).find_all('p'):
                sentencecount+=1
                script.append([str(sentencecount),str(seasoncount),epname,line.getText()])


        writename=transcript[:-4]+'.csv'
        with open(writename,'w') as file:
            wr = csv.writer(file, delimiter=',')
            wr.writerows([['SentenceID','Season','Episode','Sentence']])
            wr.writerows(script)
        seasoncount=1

In [171]:
season=os.path.join(folder_name,seasonfolder[0])
glob.glob(os.path.join(season,'*'))[0].split('/')[-1][:-4]

'Something_Ricked_This_Way_Comes'

### Assess gathering
This works well, but inspecting the souped text show that several episodes have a different html format than above. Also worth noting that at the time of initial commit/scrape, 5 episodes of season 3 had no transcript. Here are the transcripts with issues:

1. Season 1
    - Lawnmower_Dog (one line)
    - Pilot (one line)
    - Raising_Gazorpazorp (nothing extracted)
    - Ricksy_Business (two lines)
    - Something_Ricked_This_Way_Comes (two lines)
2. Season 2
    - Big_Trouble_in_Little_Sanchez *(No transcript)*
    - Look_Who%27s_Purging_Now (three lines)
3. Season 3
    - Mort%27s_Mind_Blowers *(No transcript)*
    - Pickle_Rick (one line)
    - Rest_and_Ricklaxation (nothing extracted, ambiguous transcript)
    - Rickmancing_the_Stone (one line)
    - The_ABC%27s_of_Beth *(No transcript)*
    - The_Rickchurian_Mortydate *(No transcript)*
    - The_Rickshank_Rickdemtion (two lines)
    - The_Whirly_Dirly_Conspiracy *(No transcript)*
    - Vindicators_3/_The_Return_of_Worldender *(No transcript)*
  
  
Season 3 clearly has some troubles and missing data, but the issues with season 1 & 2 can be fixed with different parsing/extraction. We can try a new extraction on all problem transcripts, and then iterate as we solve various problems.

### Cleanup Gather (iteration 1)

In [425]:
troubleextract1=['Season_1/Lawnmower_Dog',
                 'Season_1/Pilot',
                 'Season_1/Raising_Gazorpazorp',
                 'Season_1/Ricksy_Business',
                 'Season_1/Something_Ricked_This_Way_Comes',
                 'Season_2/Look_Who%27s_Purging_Now',
                 'Season_3/Pickle_Rick',
                 'Season_3/Rest_and_Ricklaxation',
                 'Season_3/Rickmancing_the_Stone',
                 'Season_3/The_Rickshank_Rickdemption']
troubleextract1=['seasonlinks/'+ s +'.txt' for s in troubleextract1]
print(troubleextract1)

['seasonlinks/Season_1/Lawnmower_Dog.txt', 'seasonlinks/Season_1/Pilot.txt', 'seasonlinks/Season_1/Raising_Gazorpazorp.txt', 'seasonlinks/Season_1/Ricksy_Business.txt', 'seasonlinks/Season_1/Something_Ricked_This_Way_Comes.txt', 'seasonlinks/Season_2/Look_Who%27s_Purging_Now.txt', 'seasonlinks/Season_3/Pickle_Rick.txt', 'seasonlinks/Season_3/Rest_and_Ricklaxation.txt', 'seasonlinks/Season_3/Rickmancing_the_Stone.txt', 'seasonlinks/Season_3/The_Rickshank_Rickdemption.txt']


In [209]:
transcript=troubleextract1[0]
with open (transcript,encoding='utf-8') as file:
    soup=BeautifulSoup(file,'lxml')
    print(soup.find('div',{'id':"mw-content-text"}).find_all('p')[0])

<p><i>(<a class="mw-redirect" href="/wiki/Jerry" title="Jerry">Jerry</a> and <a class="mw-redirect" href="/wiki/Summer" title="Summer">Summer</a> are in the living room. Jerry is flipping through channels on TV and Summer is texting)</i><br/>
<b>TV:</b> Coin collecting is considered the perfect hobby.<br/>
beautiful putt right there good birdie.<br/>
That's only the eighth birdie of the day.<br/>
<i>(<a href="/wiki/Snuffles" title="Snuffles">Snuffles</a> walks up to Jerry are sits there, looking at him)</i><br/>
<b>Jerry:</b> What? Why are you looking at me? You want to go outside? Outside? <i>(Sigh)</i><br/>
<i>(Jerry opens the door to let Snuffles out but he still just stands there)<br/></i>
<b>Jerry:</b> Outside?<br/>
<i>(Snuffles pees on the carpet)</i><br/>
<b>Jerry:</b> Are you kidding me?! Come on!<br/>
<b>Summer:</b> Oh, my God.<br/>
<i>(<a class="mw-redirect" href="/wiki/Morty" title="Morty">Morty</a> hears his dad yelling and runs into the room to check up on him)</i><br/>
<b

Everything is in one 'p' tagged element, but it looks like we might be able to parse through with line breaks (< br/>)?

In [411]:
transcript=troubleextract1[0]
with open (transcript,encoding='utf-8') as file:
    soup=BeautifulSoup(file,'lxml')
    chunk=str(soup.find('div',{'id':"mw-content-text"}).find('p'))
    chunk=chunk.split('<br/>')
#     print(chunk[6])
    for line in chunk:
        print(line,'end')


<p><i>(<a class="mw-redirect" href="/wiki/Jerry" title="Jerry">Jerry</a> and <a class="mw-redirect" href="/wiki/Summer" title="Summer">Summer</a> are in the living room. Jerry is flipping through channels on TV and Summer is texting)</i> end

<b>TV:</b> Coin collecting is considered the perfect hobby. end

beautiful putt right there good birdie. end

That's only the eighth birdie of the day. end

<i>(<a href="/wiki/Snuffles" title="Snuffles">Snuffles</a> walks up to Jerry are sits there, looking at him)</i> end

<b>Jerry:</b> What? Why are you looking at me? You want to go outside? Outside? <i>(Sigh)</i> end

<i>(Jerry opens the door to let Snuffles out but he still just stands there) end
</i>
<b>Jerry:</b> Outside? end

<i>(Snuffles pees on the carpet)</i> end

<b>Jerry:</b> Are you kidding me?! Come on! end

<b>Summer:</b> Oh, my God. end

<i>(<a class="mw-redirect" href="/wiki/Morty" title="Morty">Morty</a> hears his dad yelling and runs into the room to check up on him)</i> end

<b

In [220]:
transcript=troubleextract1[0]
with open (transcript,encoding='utf-8') as file:
    soup=BeautifulSoup(file,'lxml')
    chunk=soup.find('div',{'id':"mw-content-text"}).find('p')
    sentencecount=0
    script=[]
    writename=transcript[:-4]+'.csv'
    epname=transcript.split('/')[2][:-4]
    seasoncount=transcript.split('/')[1][-1]
    
    for line in chunk.childGenerator():
        sentencecount+=1
        script.append([str(sentencecount),str(seasoncount),epname,line])


    with open(writename,'w') as file:
        wr = csv.writer(file, delimiter=',')
        wr.writerows([['SentenceID','Season','Episode','Sentence']])
        wr.writerows(script)


Didn't work, it split every break into it's own line.

In [412]:
transcript=troubleextract1[0]
with open (transcript,encoding='utf-8') as file:
    soup=BeautifulSoup(file,'lxml')
    chunk=str(soup.find('div',{'id':"mw-content-text"}).find('p'))
    sentencecount=0
    script=[]
    writename=transcript[:-4]+'.csv'
    epname=transcript.split('/')[2][:-4]
    seasoncount=transcript.split('/')[1][-1]
    
    chunk=chunk.split('<br/>')
    for line in chunk:
        sentencecount+=1
        script.append([str(sentencecount),str(seasoncount),epname,line])



    with open(writename,'w') as file:
        wr = csv.writer(file, delimiter=',')
        wr.writerows([['SentenceID','Season','Episode','Sentence']])
        wr.writerows(script)

Seems to work. I am cautiously optimistic that this will fix all the one-line issues.

##### Apply to all  troubled transcripts and reassess

In [415]:
for transcript in troubleextract1:
    with open (transcript,encoding='utf-8') as file:
        soup=BeautifulSoup(file,'lxml')
        chunk=str(soup.find('div',{'id':"mw-content-text"}).find('p'))
        sentencecount=0
        script=[]
        writename=transcript[:-4]+'.csv'
        epname=transcript.split('/')[2][:-4]
        seasoncount=transcript.split('/')[1][-1]

        chunk=chunk.split('<br/>')
        for line in chunk:
            sentencecount+=1
            script.append([str(sentencecount),str(seasoncount),epname,line])



        with open(writename,'w') as file:
            wr = csv.writer(file, delimiter=',')
            wr.writerows([['SentenceID','Season','Episode','Sentence']])
            wr.writerows(script)

Moderate success! This fixed most of the one liners. Reassessment:

1. Season 1
    - Lawnmower_Dog (one line) **FIXED**
    - Pilot (one line) **FIXED**
    - Raising_Gazorpazorp (nothing extracted)
    - Ricksy_Business (two lines) (now single line)
    - Something_Ricked_This_Way_Comes (two lines) (now single line)
2. Season 2
    - Big_Trouble_in_Little_Sanchez *(No transcript)*
    - Look_Who%27s_Purging_Now (three lines) (now single line)
3. Season 3
    - Mort%27s_Mind_Blowers *(No transcript)*
    - Pickle_Rick (one line) **FIXED**
    - Rest_and_Ricklaxation (nothing extracted, ambiguous transcript)
    - Rickmancing_the_Stone (one line) **FIXED**
    - The_ABC%27s_of_Beth *(No transcript)*
    - The_Rickchurian_Mortydate *(No transcript)*
    - The_Rickshank_Rickdemtion (two lines)
    - The_Whirly_Dirly_Conspiracy *(No transcript)*
    - Vindicators_3/_The_Return_of_Worldender *(No transcript)*
    
 
  

**REVISED** problem data 

1. Season 1
    - Raising_Gazorpazorp (nothing extracted)
    - Ricksy_Business (two lines) (now single line)
    - Something_Ricked_This_Way_Comes (two lines) (now single line)    
2. Season 2
    - Big_Trouble_in_Little_Sanchez (one line)
    - Look_Who%27s_Purging_Now (three lines) (now single line)    
3. Season 3
    - Rest_and_Ricklaxation (nothing extracted, ambiguous transcript)
    - The_Rickshank_Rickdemtion (two lines)

Closely looking at these transcripts, really only one more of these is a extraction error (Rickshank_Rickdemtion). The rest are incomplete transcripts. This will need pretty serious manual correction to include them in the dataset, but potentially we have a decent enough set to try a few things.    

### Cleanup Gather (iteration 2)

In [419]:
troubleextract2=['Season_1/Raising_Gazorpazorp',
                 'Season_1/Ricksy_Business',
                 'Season_1/Something_Ricked_This_Way_Comes',
                 'Season_2/Look_Who%27s_Purging_Now',
                 'Season_3/Rest_and_Ricklaxation',
                 'Season_3/The_Rickshank_Rickdemption']
troubleextract2=['seasonlinks/'+ s +'.txt' for s in troubleextract2]
print(troubleextract2)

['seasonlinks/Season_1/Raising_Gazorpazorp.txt', 'seasonlinks/Season_1/Ricksy_Business.txt', 'seasonlinks/Season_1/Something_Ricked_This_Way_Comes.txt', 'seasonlinks/Season_2/Big_Trouble_in_Little_Sanchez.txt', 'seasonlinks/Season_2/Look_Who%27s_Purging_Now.txt', 'seasonlinks/Season_3/Rest_and_Ricklaxation.txt', 'seasonlinks/Season_3/The_Rickshank_Rickdemption.txt']


In [453]:
transcript=troubleextract2[-1]
with open (transcript,encoding='utf-8') as file:
    soup=BeautifulSoup(file,'lxml')
    chunk=soup.find('div',{'id':"mw-content-text"}).find_all('p')
    chunk=str(chunk[-1])
    chunk=chunk.split('<br/>')
    for line in chunk:
        print(line,'end')

<p><b><a class="mw-redirect" href="/wiki/Rick" title="Rick">Rick</a>:</b> Anyway, that's how I escaped from space prison! Oh, scary place. end

<b><a class="mw-redirect" href="/wiki/Morty" title="Morty">Morty</a>:</b> Wow, Rick! That's -- That's one -- one heck of a story. I sure do wish I could have been there to see it happen. end

<b>Rick:</b> Oh, come on. Who wants to watch a mad scientist use handmade sci-fi tools to take out highly trained alien guards when we can sit here and be a family at <a href="/wiki/Shoney%27s" title="Shoney's">Shoney's</a>?  end

<b><a class="mw-redirect" href="/wiki/Beth" title="Beth">Beth</a>:</b> Dad, it's great to have you back, no matter where we are, but wouldn't you like to go home? end

<b>Rick:</b> Emotionally speaking, honey, Shoney's is my home. end

<b><a class="mw-redirect" href="/wiki/Jerry" title="Jerry">Jerry</a>:</b> Yeah, but you just got out of prison. I mean, how much of a step up from that is --  end

<b>Rick:</b> Jerry, get out of th

In [454]:
# for transcript in troubleextract2:
transcript=troubleextract2[-1]

with open (transcript,encoding='utf-8') as file:
    soup=BeautifulSoup(file,'lxml')
    chunk=soup.find('div',{'id':"mw-content-text"}).find_all('p')
    chunk=str(chunk[-1])
    sentencecount=0
    script=[]
    writename=transcript[:-4]+'.csv'
    epname=transcript.split('/')[2][:-4]
    seasoncount=transcript.split('/')[1][-1]

    chunk=chunk.split('<br/>')
    for line in chunk:
        sentencecount+=1
        script.append([str(sentencecount),str(seasoncount),epname,line])



    with open(writename,'w') as file:
        wr = csv.writer(file, delimiter=',')
        wr.writerows([['SentenceID','Season','Episode','Sentence']])
        wr.writerows(script)

## Assess

## Clean

#### Define

#### Code

#### Test