Skip to content

Latest commit

 

History

History
176 lines (136 loc) · 5.98 KB

index.md

File metadata and controls

176 lines (136 loc) · 5.98 KB
title subtitle author job framework highlighter hitheme widgets mode knit logo
GOING DOWN TO SOUTH PARK
to make some tidytext analysis
PATRIK DRHLÍK
freelance data scientist
io2012
highlight.js
tomorrow
selfcontained
slidify::knit2slides
boys.png

Web scraping and R packages

[South Park episode transcripts](https://southpark.wikia.com/wiki/Portal:Scripts)

[IMDB South Park episode ratings](https://www.imdb.com/title/tt0121955/episodes)

Main R packages: tidyverse, tidytext, southparkr


Glimpse at the data

## Observations: 312,767
## Variables: 16
## $ season                <fct> Season Thirteen, Season One, Season Sixt...
## $ season_number         <int> 13, 1, 16, 5, 15, 3, 11, 9, 13, 3, 21, 8...
## $ season_episode_number <dbl> 7, 10, 14, 10, 10, 4, 1, 2, 5, 15, 6, 1,...
## $ episode               <fct> Fatbeard, Damien, Obama Wins!, How to Ea...
## $ episode_number        <int> 188, 10, 237, 75, 219, 35, 154, 127, 186...
## $ character             <chr> "cartman", "stan", "kyle", "jonesy", "mr...
## $ year                  <int> 2009, 1997, 2012, 2001, 2011, 1999, 2007...
## $ line_number           <int> 63817, 4528, 76451, 31001, 72011, 16346,...
## $ word                  <chr> "hey", "bubye", "program", "wrong", "gen...
## $ word_stem             <chr> "hei", "buby", "program", "wrong", "gene...
## $ swear_word            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ episode_name          <chr> "Fatbeard", "Damien", "Obama Wins!", "Ho...
## $ air_date              <date> 2009-04-22, 1998-02-04, 2012-11-07, 200...
## $ user_rating           <dbl> 8.2, 8.1, 7.5, 8.1, 7.6, 6.7, 8.8, 8.8, ...
## $ user_votes            <dbl> 1578, 1703, 1156, 1488, 1229, 1546, 2356...
## $ score                 <int> NA, NA, NA, -2, 2, NA, NA, NA, NA, NA, N...

Basic statistics about the show

figures text
21 Number of seasons
287 Number of episodes
914 475 Number of words
312 767 No stopwords (a, the, this, ...)
6 170 Number of swear words
1.97 % of swear words
34.2 % used for analysis
4 403 Number of characters
8.14 Mean IMDB rating
9.6 Scott Tenorman Must Die (S05E04)
6.3 A Million Little Fibers (S10E05)


Overall sentiment analysis

plot of chunk unnamed-chunk-4


Episode popularity

plot of chunk unnamed-chunk-5

--- #naughty-episodes

Are naughty episodes more popular?

plot of chunk unnamed-chunk-6

--- #mysterion

So who's the naughtiest character?


It's Kenny!


plot of chunk unnamed-chunk-7


Contact

[https://www.linkedin.com/in/patrik-drhlik/](https://www.linkedin.com/in/patrik-drhlik/)

[https://github.com/pdrhlik](https://github.com/pdrhlik)

[@PatrioScraper](https://twitter.com/PatrioScraper)

[patrik.drhlik@gmail.com](mailto:patrik.drhlik@gmail.com)

[https://www.patrio.blog](https://www.patrio.blog)