## Gathering

Importing packages for loading, gathering data and visualization

In [3]:
import pandas as pd
import requests
import os

First source is `twitter-archive-enhanced.csv`. It had the major chunk of the data about tweets of the WeRateDogs account from 2015 to 2017.

In [4]:
archive_df = pd.read_csv('twitter-archive-enhanced.csv')

In [5]:
archive_df.sample(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1263,710117014656950272,,,2016-03-16 14:54:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This pupper got her hair chalked for her birth...,,,,https://twitter.com/dog_rates/status/710117014...,11,10,,,,pupper,
139,865359393868664832,,,2017-05-19 00:12:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sammy. Her tongue ejects without warni...,,,,https://twitter.com/dog_rates/status/865359393...,13,10,Sammy,,,,
1820,676588346097852417,,,2015-12-15 02:23:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bubbles. He kinda resembles a fish. Al...,,,,https://twitter.com/dog_rates/status/676588346...,5,10,Bubbles,,,,


Second source is a file that was to be programmatically downloaded from the Udacity servers which had the results of the machine learning algorithm 'neural network' performed on the images from the WeRateDogs account. I downloaded this file using the Python library requests.

In [6]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get (url)

folder_name = 'image-predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
    
with open (os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)
    
img_predictions_df = pd.read_csv(folder_name + '/image-predictions.tsv', sep='\t')
img_predictions_df.sample(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1376,763183847194451968,https://pbs.twimg.com/media/CpdfpzKWYAAWSUi.jpg,1,miniature_poodle,0.354674,True,toy_poodle,0.338642,True,teddy,0.155828,False
951,704871453724954624,https://pbs.twimg.com/media/Ccg02LiWEAAJHw1.jpg,1,Norfolk_terrier,0.689504,True,soft-coated_wheaten_terrier,0.10148,True,Norwich_terrier,0.055779,True
2000,876120275196170240,https://pbs.twimg.com/media/DCiavj_UwAAcXep.jpg,1,Bernese_mountain_dog,0.534327,True,Saint_Bernard,0.346312,True,Greater_Swiss_Mountain_dog,0.094933,True


Third source is The third source for gathering data was web scrapping off Twitter using its Tweepy API using the tweet IDs found in the file at hand. The Tweepy API is an easy to use Python-based API which connects to a twitter account using secret and public keys. Once authenticated, one can easily scrap tweets off twitter.

## Assessing

Three dataframes:

- `tweets_df` which has retweet and favorite counts
- `img_predictions_df` has the results of a neural network trying to identify dog breed in a tweet's picture
- `archive_df` has the tweet's text, rating, and dog category

there were three different data sources, there had to be problems between the three files. The task at hand is finding and clean at least 8 data quality and two tidiness issues

In [7]:
img_predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


tweet_id should be string
columns' name tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images)
rest columns aren't related to images tidiness issue

In [8]:
archive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [9]:
sum(archive_df.duplicated())

0

In [10]:
sum(img_predictions_df.duplicated())

0

In [11]:
archive_df.name.value_counts().head(5)

None       745
a           55
Charlie     12
Cooper      11
Lucy        11
Name: name, dtype: int64

## Issues

### Tidiness issues

Tiddiness: issues with structure. Untidy data is also known as messy data.
Tidy data requirements:

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.

`tweets_df` table 
- retweets and favorites in their own table
- Select only columns important for the analysis

archive_df table 
- dog stages in multiple columns



### Quailty issues

Quality: issues with content. Low quality data is also known as dirty data.

`archive_df` table
- tweet_id is int
- timestamp is str
- in_reply_to_status_id is float
- in_reply_to_user_id is float
- retweeted_status_id is float
- retweeted_status_user_id is float
- retweeted_status_timestamp is str
- dog stages are str
- text is cut off with ellipses
- incorrect dog names (a, an, the, just, one, very, quite, not, actually, mad, space, infuriating, all, officially, 0, old, life, unacceptable, my, incredibly, by, his, such)
- Extract the source of tweet, from iphone or others.

`img_predictions_df` table
- tweet_id is int in img_predictions_df

tweet_id should be string
columns' name tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images)
rest columns aren't related to images tidiness issue

## Cleaning

In [15]:
### let us make a copy the original dataframe
archive_clean=archive_df.copy()
img_predictions_clean=img_predictions_df.copy()
# tweet_df_clean=tweet_df.copy()

### Tidiness
Best practies, start in sloving tidiness issues

#### Issue ( 1 )
##### Define
#rewrite this sentence
retweets and favorites in their own table (tweets_df)
Inner join tweets_df, archive_df, and img_predictions on tweet_id.

Merge the `tweets_df` and `img_predictions`tables to the `archive_df` table, joining on *tweet_id*.

##### Code

In [None]:
archive_clean = pd.merge(archive_clean, tweets_clean,
                        how = 'inner', on = 'tweet_id')
archive_clean = pd.merge(archive_clean, img_pred_clean,
                        how = 'inner', on = 'tweet_id')

##### Test

In [20]:
archive_clean.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,NaT,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,NaT,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,NaT,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,NaT,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,NaT,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### Issue ( 2 )
##### Define
Melt dog stage columns into single column(doggo, floofer, floofer, and puppo) in `archive_df` table.


##### Code

In [None]:
archive_clean = pd.melt(archive_clean,
                        id_vars=['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id',
                                 'timestamp', 'source', 'text', 'retweeted_status_id',
                                 'retweeted_status_user_id', 'retweeted_status_timestamp',
                                 'expanded_urls', 'rating_numerator', 'rating_denominator',
                                 'name', 'retweets', 'favorites', 'jpg_url', 'img_num', 'p1',
                                 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf',
                                 'p3_dog'],
                        value_name='dog_stage')
archive_clean = archive_clean.drop('variable', axis=1)

##### Test

In [None]:
archive_clean.head()

#### Issue ( 3 )
##### Define

##### Code

##### Test

### Quailty

#### Issue ( 1 )
##### Define

convert data type for 
- `tweet_id`,`in_reply_to_status_id`,`in_reply_to_user_id`, `retweeted_status_id` and `retweeted_status_user_id` columns to String
- `dog_stage` to categorical type
- `timestamp` and `retweeted_status_timestamp` to datetime

##### Code

In [16]:
archive_clean.tweet_id = archive_clean.tweet_id.astype(str)
archive_clean.in_reply_to_status_id = archive_clean.in_reply_to_status_id.astype(str)
archive_clean.in_reply_to_user_id = archive_clean.in_reply_to_user_id.astype(str)
archive_clean.retweeted_status_id = archive_clean.retweeted_status_id.astype(str)
archive_clean.retweeted_status_user_id = archive_clean.retweeted_status_user_id.astype(str)

archive_clean.dog_stage = archive_clean.dog_stage.astype('category')

archive_clean.timestamp = pd.to_datetime(archive_clean.timestamp)
archive_clean.retweeted_status_timestamp = pd.to_datetime(archive_clean.retweeted_status_timestamp)


##### Test

In [17]:
archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null datetime64[ns, UTC]
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null datetime64[ns, UTC]
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes

#### Issue ( 2 )
##### Define


In [18]:
df_clean['dog_stage'].unique()

NameError: name 'df_clean' is not defined

##### Code

##### Test

#### Issue ( 3 )
##### Define
names incorrectly identified from available text

##### Code

##### Test

#### Issue ( 4 )
##### Define

##### Code

##### Test

## Store

In [None]:
archive_clean.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False)

## Analysis