# Data Science 5K Capstone Proposal
In order to get your capstone approved, you must complete all of the following steps.

## 1) Get your data
You may use any data set(s) you like, so long as they meet these criteria:

* Your data cannot have _anything_ to do with your work at Booz Allen Hamilton.
* Your data must be publically available for free.
* Your data should be interesting to _you_. You want your capstone to be something you're proud of.
* Your data should be "big enough":
    - It should have at least 1,000 rows.
    - It should have enough of columns to be interesting.
    - If you have questions, contact a member of the instructional team.

## 2) Provide a link to your data
Your data is required to be free and open to anyone. As such, you should have a URL which anyone can use to download your data:

In [2]:
# Enter link here.

#   https://github.com/finalfire/got_ratings              -- Rotten Tomatoes ratings
#   https://www.kaggle.com/rezaghari/game-of-thrones      -- Episode data (release date, writers, director, US viewership, IMDB rating)
#   https://github.com/jeffreylancaster/game-of-thrones   -- Character and word-count data

#   For simplicity I used Excel to combine these datasets into a single .xlsm file, then converted the relevant data to .csv

## 3) Import your data
In the space below, import your data. If your data span multiple files, read them all in. If applicable, merge or append them as needed.

In [2]:
import pandas as pd

got = pd.read_csv('got.csv')


## 4) Show me the head of your data.

In [3]:
got.head()

Unnamed: 0,Index,Seas_Ep,Season,Episode,Episode_Title,Character,Word_Count,Release_date,Duration,Rating,Writer_1,Writer_2,Director,US_Viewers,rate,rate_aud
0,0,S1E1,1,1,Winter Is Coming,Will,2,4/17/2011,62,9.1,David Benioff,D.B. Weiss,Timothy Van Patten,2.2,100.0,96.0
1,1,S1E2,1,2,The Kingsroad,Jorah Mormont,7,4/24/2011,56,8.8,David Benioff,D.B. Weiss,Timothy Van Patten,2.2,100.0,96.0
2,2,S1E3,1,3,Lord Snow,Royal Steward,21,5/1/2011,58,8.7,David Benioff,D.B. Weiss,Brian Kirk,2.4,86.0,96.0
3,3,S1E4,1,4,"Cripples, Bastards, and Broken Things",Old Nan,6,5/8/2011,56,8.8,David Benioff,D.B. Weiss,Brian Kirk,2.5,100.0,96.0
4,4,S1E5,1,5,The Wolf and the Lion,Eddard Stark,9,5/15/2011,55,9.1,David Benioff,D.B. Weiss,Brian Kirk,2.6,95.0,96.0


## 5) Show me the shape of your data

In [4]:
got.shape

(32923, 16)

## 6) Show me the proportion of missing observations for each column of your data

In [5]:
got.isna().sum()


Index               0
Seas_Ep             0
Season              0
Episode             0
Episode_Title       0
Character        9322
Word_Count       9322
Release_date        0
Duration            0
Rating              0
Writer_1            0
Writer_2            0
Director            0
US_Viewers          0
rate                0
rate_aud            0
dtype: int64

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

<div class="alert-success">  I would like to examine what drove the variability in Game of Thrones' critic scores and viewership throughout its run. I am particularly interested to learn whether certain actors, writers, or directors were particularly well or poorly received. </div>

## 8) What is your _y_-variable?
For Part C of your capstone, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?

<div class="alert-success">  My y variable will be IMDB ratings, though I will also be looking at Rotten Tomatoes Critic and Audience Scores, and US viewership.</div>

## Update:

After completing EDA, we determined that finding a ratings relationship was infeasible, so we switched to an NLP analysis. Our problem statement is to see if we can determine a character's social class based on their vocabulary. We will also be identifying wordprint similarities between characters.