## Project Data Wrangling - We rate dogs

This project will assure you have mastered the subjects covered in the statistics lessons.  The hope is to have this project be as comprehensive of these topics as possible.  Good luck!

## Table of Contents
- [Gathering Data](#gather)
- [Assessing Data](#assess)
- [Cleaning Data](#clean)
- [Analyzing Data](#analyze)




In [61]:
import tweepy
import pandas as pd
import numpy as np
import requests
import os
import re
# use module dotenv to manage API keys and secrets
%load_ext dotenv
%dotenv

consumer_key = os.environ.get('CONSUMER_KEY')
consumer_secret = os.environ.get('CONSUMER_SECRET')
access_token = os.environ.get('ACCESS_TOKEN')
access_secret = os.environ.get('ACCESS_SECRET')
tsv_url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"


auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)


The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


<a id='gather'></a>
### Gathering Data
We gather data from 3 several sources:
1. enhanced Twitter archive: a csv file 'twitter-archive-enhanced.csv' (data stored in `df_tweets_raw`)
2. Additional Data via the Twitter API
3. Image Predictions File

Let's start with getting the data from twitter archive file and take a look at a few records:

In [5]:
df_tweets_raw = pd.read_csv('twitter-archive-enhanced.csv')
df_tweets_raw.sample(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
739,780601303617732608,,,2016-09-27 02:53:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Hercules. He can have whatever he wants f...,,,,https://twitter.com/dog_rates/status/780601303...,12,10,Hercules,,,,
2056,671357843010908160,,,2015-11-30 15:59:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Tfw she says hello from the other side. 9/10 h...,,,,https://twitter.com/dog_rates/status/671357843...,9,10,,,,,
2271,667495797102141441,,,2015-11-20 00:12:54 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",This is Philippe from Soviet Russia. Commandin...,,,,https://twitter.com/dog_rates/status/667495797...,9,10,Philippe,,,,


<a id='assess'></a>
## Assessing Data

### 1.) Advanced Twitter Archive
**Quality:**  
- `name` column: some entries seem to be regular words rather than names (visually explored)
- `rating_denominator` column: several entries are <> 10, indicating invalid rating values (visually explored)


In [6]:
df_tweets_raw.isnull().sum() / df_tweets_raw.shape[0]

tweet_id                      0.000000
in_reply_to_status_id         0.966893
in_reply_to_user_id           0.966893
timestamp                     0.000000
source                        0.000000
text                          0.000000
retweeted_status_id           0.923175
retweeted_status_user_id      0.923175
retweeted_status_timestamp    0.923175
expanded_urls                 0.025042
rating_numerator              0.000000
rating_denominator            0.000000
name                          0.000000
doggo                         0.000000
floofer                       0.000000
pupper                        0.000000
puppo                         0.000000
dtype: float64

In [15]:
df_tweets_raw['doggo'].isnull()

0       False
1       False
2       False
3       False
4       False
        ...  
2351    False
2352    False
2353    False
2354    False
2355    False
Name: doggo, Length: 2356, dtype: bool

In [18]:
df_tweets_raw['doggo'].replace('None', np.nan, inplace=True)

In [27]:
df_tweets_raw[df_tweets_raw.doggo.notna()].doggo

array(['doggo'], dtype=object)

<a id="clean"></a>
## Cleaning Data



In [80]:
#make copy to work with while cleaning the data
df_tweets_clean = df_tweets_raw.copy()

### Tidiness
#### Validity
##### Issue: Some names in the name column aren't actually names but regular words. 
##### Define
Issue is probably due to a naive assumptions in parsing process of the tweet's text ("This is *dogname*"). So, my solution is to create
a new name column and extract only valid names from the name column. That is, only names with more than one letter starting with upper case.
##### Code

In [72]:
df_tweets_clean['name_extract']=''
df_tweets_clean['name_extract']= df_tweets_clean.name.str.extract(r"^([A-Z]\w+)")

##### Test

In [89]:
# check to see if only regluar words are left in the original name column
df_tweets_clean[df_tweets_clean.name_extract != df_tweets_clean.name]['name'].value_counts()

a               55
the              8
an               7
very             5
quite            4
one              4
just             4
not              2
mad              2
getting          2
actually         2
O                1
his              1
incredibly       1
this             1
by               1
all              1
light            1
old              1
life             1
space            1
infuriating      1
such             1
my               1
unacceptable     1
officially       1
Name: name, dtype: int64

Unnamed: 0,name_extract,name,text
22,,such,I've yet to rate a Venezuelan Hover Wiener. Th...
56,,a,Here is a pupper approaching maximum borkdrive...
118,,quite,RT @dog_rates: We only rate dogs. This is quit...
169,,quite,We only rate dogs. This is quite clearly a smo...
193,,quite,"Guys, we only rate dogs. This is quite clearly..."
...,...,...,...
2349,,an,This is an odd dog. Hard on the outside but lo...
2350,,a,This is a truly beautiful English Wilson Staff...
2352,,a,This is a purebred Piers Morgan. Loves to Netf...
2353,,a,Here is a very happy pup. Big fan of well-main...
