## Gathering

Importing packages for loading, gathering data and visualization

In [9]:
import pandas as pd
import requests
import os

First source is `twitter-archive-enhanced.csv`. It had the major chunk of the data about tweets of the WeRateDogs account from 2015 to 2017.

In [3]:
archive_df = pd.read_csv('twitter-archive-enhanced.csv')

In [5]:
archive_df.sample(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2051,671488513339211776,,,2015-12-01 00:38:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Julius. He's a cool dog. Carries seash...,,,,https://twitter.com/dog_rates/status/671488513...,8,10,Julius,,,,
699,786286427768250368,,,2016-10-12 19:24:27 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",This is Arnie. He's afraid of his own bark. 12...,,,,https://vine.co/v/5XH0WqHwiFp,12,10,Arnie,,,,
1815,676613908052996102,,,2015-12-15 04:05:01 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is the saddest/sweetest/best picture I've...,,,,https://twitter.com/dog_rates/status/676613908...,12,10,the,,,,


Second source is a file that was to be programmatically downloaded from the Udacity servers which had the results of the machine learning algorithm 'neural network' performed on the images from the WeRateDogs account. I downloaded this file using the Python library requests.

In [17]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get (url)

folder_name = 'image-predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
    
with open (os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)
    
img_predictions_df = pd.read_csv(folder_name + '/image-predictions.tsv', sep='\t')
img_predictions_df.sample(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
803,691483041324204033,https://pbs.twimg.com/media/CZikKBIWYAA40Az.jpg,1,bloodhound,0.886232,True,black-and-tan_coonhound,0.07742,True,Gordon_setter,0.009826,True
874,698178924120031232,https://pbs.twimg.com/media/CbBuBhbWwAEGH29.jpg,1,Chesapeake_Bay_retriever,0.351868,True,malinois,0.207753,True,Labrador_retriever,0.154606,True
266,670804601705242624,https://pbs.twimg.com/media/CU8tOJZWUAAlNoF.jpg,1,Pomeranian,0.86856,True,Pekinese,0.090129,True,chow,0.021722,True


Third source is The third source for gathering data was web scrapping off Twitter using its Tweepy API using the tweet IDs found in the file at hand. The Tweepy API is an easy to use Python-based API which connects to a twitter account using secret and public keys. Once authenticated, one can easily scrap tweets off twitter.

## Assessing

Three dataframes:

- `tweets_df` which has retweet and favorite counts
- `img_predictions_df` has the results of a neural network trying to identify dog breed in a tweet's picture
- `archive_df` has the tweet's text, rating, and dog category

there were three different data sources, there had to be problems between the three files. The task at hand is finding and clean at least 8 data quality and two tidiness issues

In [18]:
img_predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


tweet_id should be string
columns' name tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images)
rest columns aren't related to images tidiness issue

In [20]:
archive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

tweet_id string
source quailty issue
timestamp date type and tidiness issue should be seprate one for date and one for time
columns name
doggo, floofer, pupper, puppo tidiness issue 
name shouldn't have @

In [22]:
sum(archive_df.duplicated())

0

In [23]:
sum(img_predictions_df.duplicated())

0

In [26]:
archive_df.name.value_counts().head(5)

None       745
a           55
Charlie     12
Oliver      11
Cooper      11
Name: name, dtype: int64

### Tidiness issues

tweets_df table 
- retweets and favorites in their own table
archive_df table 
- dog stages in multiple columns


## Cleaning

In [None]:
### let us make a copy the original dataframe
df_clean=df.copy()
images_clean=images.copy()
tweet_df_clean=tweet_df.copy()