# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [543]:
# import packages
import pandas as pd
import numpy as np
import requests as re
import os
import tweepy as tp
import json

In [496]:
# read WeRateDogs Twitter archive file on hand
twitter_archive = pd.read_csv(r'C:\Users\DE\Documents\Data wrangling_Udacity\twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [497]:
# write file tsv containing images
path = os.getcwd() # get the current path
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = re.get(url)
if not os.path.exists(url.split('/')[-1]):
    with open(os.path.join(path, url.split('/')[-1]), mode = 'wb') as file:
        file.write(response.content)
        
# read image prediction tsv file into dataframe
image_prediction = pd.read_csv(r'C:\Users\DE\Documents\Data wrangling_Udacity\image-predictions.tsv', sep = '\t')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [498]:
# query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# omit keys from submission to comply with Twitter's API terms and conditions
consumer_key = 'trwzMEOtXrAFiyJRW2hZVqA5c'
consumer_secret = 'JVI4rTTIyccfAABQDjprAGnpsQ1CEyIWaNfXSTfVei9V5mxVLv'
access_token = '1576941143874760704-dhdfW9p2fDUxO2bkxKGOBnTAa2mxZZ'
access_secret = 'IIuy2YQxQheg8zG3HN9C9czHAenjpn7rC7AALpctel1ov'

auth = tp.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tp.API(auth, wait_on_rate_limit=True)

In [499]:
# avoid gather data again in case the file is ready
if not os.path.exists('tweet_json.txt'):
    # list twitter id into array to query Twitter API for additional information
    list_id = twitter_archive.tweet_id.values

    # store each tweet's entire set of JSON data in a file called tweet_json.txt file
    fails_dict = {}
    with open('tweet_json.txt', 'w', encoding = 'utf-8') as f: 
        for id in list_id: 
            try: 
                tweet = api.get_status(id, tweet_mode='extended')
                json.dump(tweet._json, f)
                f.write('\n')
            except tp.errors.TweepyException as e: 
                fails_dict[id] = e
                pass   

In [500]:
# read tweet_json.txt line by line into dataframe 
# with tweet ID, retweet count, favorite count
df_list = []
with open('tweet_json.txt', 'r', encoding = 'utf-8') as f:
    for count, line in enumerate(f): 
        tweet_id = json.loads(line)['id']
        retweet_count = json.loads(line)['retweet_count']
        favorite_count = json.loads(line)['favorite_count']
        df_list.append({'tweet_id': tweet_id, 
        'retweet_count': retweet_count, 
        'favorite_count': favorite_count})

tweet_api = pd.DataFrame(df_list, columns = ['tweet_id', 'retweet_count', 'favorite_count'])       

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Quality Issues

1. Change datatype timestamp from string into datetime to exclude tweets beyond August 1st, 2017


In [501]:
# check datatype of timestamp column
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [502]:
# check data format
twitter_archive.timestamp

0       2017-08-01 16:23:56 +0000
1       2017-08-01 00:17:27 +0000
2       2017-07-31 00:18:03 +0000
3       2017-07-30 15:58:51 +0000
4       2017-07-29 16:00:24 +0000
                  ...            
2351    2015-11-16 00:24:50 +0000
2352    2015-11-16 00:04:52 +0000
2353    2015-11-15 23:21:54 +0000
2354    2015-11-15 23:05:30 +0000
2355    2015-11-15 22:32:08 +0000
Name: timestamp, Length: 2356, dtype: object

2. Rating denominator is equal to 0

In [503]:
twitter_archive[twitter_archive.rating_denominator == 0]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259576.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,,,960,0,,,,,


3. Column expanded_urls has more 2 links, one in which is not twitter link or duplicated links

In [504]:
# randomly check the data format in expanded_urls
twitter_archive.expanded_urls.sample(10)

2171    https://twitter.com/dog_rates/status/669328503...
2088    https://twitter.com/dog_rates/status/670792680...
1277    https://twitter.com/dog_rates/status/709042156...
1954    https://twitter.com/dog_rates/status/673656262...
833     https://twitter.com/dog_rates/status/739979191...
1360    https://twitter.com/dog_rates/status/703268521...
1099    https://twitter.com/dog_rates/status/735991953...
1750                        https://vine.co/v/iKIwAzEatd6
1902    https://twitter.com/dog_rates/status/674644256...
2150    https://twitter.com/dog_rates/status/669683899...
Name: expanded_urls, dtype: object

In [505]:
# check random data to see expanded_urls
twitter_archive.at[98, 'expanded_urls']

'https://www.gofundme.com/help-my-baby-sierra-get-better,https://twitter.com/dog_rates/status/873213775632977920/photo/1,https://twitter.com/dog_rates/status/873213775632977920/photo/1'

4. Column name contains wrong names

In [506]:
# special character is not the right name is: a, the, ...
twitter_archive.name.sample(10)

1535       None
472       Moose
1801       None
1191    Barclay
43         Yogi
700      Mattie
1067      Baloo
1821     Vinnie
2109       None
1302     Harper
Name: name, dtype: object

5. Tweet_ID = 740373189193256964 the dog has the rating is 14/10 

In [507]:
# check data based on url by visual assessment to see the rating
twitter_archive.at[twitter_archive[twitter_archive.tweet_id == 740373189193256964].index.values[0], 'expanded_urls']

'https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1'

6. Tweet_ID = 722974582966214656 the dog has the rating is 13/10

In [508]:
# check data based on url by visual assessment to see the rating
twitter_archive.at[twitter_archive[twitter_archive.tweet_id == 722974582966214656].index.values[0], 'expanded_urls']

'https://twitter.com/dog_rates/status/722974582966214656/photo/1'

7. Tweet_ID = 666287406224695296 the dog has the rating is 9/10

In [509]:
# check data based on url by visual assessment to see the rating
twitter_archive.at[twitter_archive[twitter_archive.tweet_id == 666287406224695296].index.values[0], 'expanded_urls']

'https://twitter.com/dog_rates/status/666287406224695296/photo/1'

8. Tweet_ID = 775096608509886464 does not exist, it is a duplication of Tweet_ID = 740373189193256964

In [510]:
# the id in the url is 740373189193256964 not 775096608509886464
twitter_archive.at[twitter_archive[twitter_archive.tweet_id == 775096608509886464].index.values[0], 'expanded_urls']

'https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1'

9. Remove data in twitter-archive-enhancement.csv for matching with the image_prediction.tsv - only data not beyond August 1st, 2017

In [511]:
twitter_archive.loc[twitter_archive.timestamp > '2017-08-01']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


### Tidiness Issues

1. Unpivot doggo, floofer, pupper, puppo into 1 column

2. Match image prediction columns in image_prediction.tsv to define dog tweet_id and columns retweet count and favorite count in tweet_api for analysis

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [512]:
# Make copies of original pieces of data
twitter_archive_cp = twitter_archive.copy()
image_prediction_cp = image_prediction.copy()
tweet_api_cp = tweet_api.copy()

### Quality issue

#### Issue #1:

##### Define: Change datatype timestamp from string into datetime to exclude tweets beyond August 1st, 2017

##### Code

In [513]:
# change datatype
twitter_archive_cp.timestamp = pd.to_datetime(twitter_archive_cp.timestamp)

# exclude data with timestamp beyond August 1st, 2017
twitter_archive_cp = twitter_archive_cp.loc[twitter_archive_cp.timestamp <= '2017-08-01']


##### Test

In [514]:
# check datatype
twitter_archive_cp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2354 entries, 2 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2354 non-null   int64              
 1   in_reply_to_status_id       78 non-null     float64            
 2   in_reply_to_user_id         78 non-null     float64            
 3   timestamp                   2354 non-null   datetime64[ns, UTC]
 4   source                      2354 non-null   object             
 5   text                        2354 non-null   object             
 6   retweeted_status_id         181 non-null    float64            
 7   retweeted_status_user_id    181 non-null    float64            
 8   retweeted_status_timestamp  181 non-null    object             
 9   expanded_urls               2295 non-null   object             
 10  rating_numerator            2354 non-null   int64           

In [515]:
# check the data in datetime to make sure not beyond 2017, Aug 1st.
(twitter_archive_cp.timestamp > '2017-08-01').sum()

0

#### Issue #2:

##### Define: Rating denominator is equal to 0
- Rating denominator should not be equal to 0, if exist, need to be excluded. 

##### Code

In [516]:
# exclude the record out of data
twitter_archive_cp = twitter_archive_cp[twitter_archive_cp.rating_denominator != 0]

##### Test

In [517]:
# test the result
(twitter_archive_cp.rating_denominator == 0).sum()

0

#### Issue #3: 

##### Define: Column expanded_urls has more than 2 links, one in which is not twitter link or duplicated links
- Tweet id has duplicated urls 
- Tweet id has other link apart from Twitter link


##### Code

In [518]:
import re

In [519]:
# create a serie to check url different than twitter url
arr = twitter_archive_cp['expanded_urls'].apply(lambda x: len(set(x.split(','))) if str(x) != 'nan' else x)
# check the rows that number of tweet_id having more than 1 url
print('number of tweet_id having more than 1 url', arr[(arr > 1)].sum()) 
print('number of tweet_id having more than 2 urls', arr[(arr > 2)].sum())

# apply for tweet_id having duplicated urls
getOnelink = lambda x: list(set(x.split(',')))[0] if (str(x) != 'nan' and len(list(set(x.split(',')))) == 1) else x
twitter_archive_cp['expanded_urls'] = twitter_archive_cp['expanded_urls'].apply(lambda x: getOnelink(x))

# apply for tweet_id having more than 1 unique url
def getTwitter(a): 
    return list(filter(lambda x: re.findall(r'.+twitter.com.+', x), a))[0]

getTwitterlink = lambda x: getTwitter(x.split(',')) if (str(x) != 'nan' and len(list(set(x.split(',')))) > 1) else x
twitter_archive_cp['expanded_urls'] = twitter_archive_cp['expanded_urls'].apply(lambda x: getTwitterlink(x))


number of tweet_id having more than 1 url 82.0
number of tweet_id having more than 2 urls 0.0


##### Test

In [520]:
# random check the column expanded_urls
twitter_archive_cp['expanded_urls'].sample(10)

308     https://twitter.com/dog_rates/status/835574547...
1739    https://twitter.com/dog_rates/status/679511351...
48      https://twitter.com/dog_rates/status/882992080...
643     https://twitter.com/dog_rates/status/793195938...
1884    https://twitter.com/dog_rates/status/674800520...
1425    https://twitter.com/dog_rates/status/697943111...
669     https://twitter.com/dog_rates/status/762699858...
34      https://twitter.com/dog_rates/status/885528943...
1154    https://twitter.com/foxdeportes/status/7251360...
642     https://twitter.com/dog_rates/status/793210959...
Name: expanded_urls, dtype: object

In [521]:
# check row having more than one link
(twitter_archive_cp['expanded_urls'].apply(lambda x: len(x.split(',')) if str(x) != 'nan' else x) > 1).sum()

0

#### Issue #4: 

##### Define: Column name contains wrong names
- Special letter is not the right name for some tweet ids, such as a, the ...

##### Code

In [522]:
# define the lower string not the name of dogs
twitter_archive_cp.name = twitter_archive_cp.name.apply(lambda x: 'None' if x.lower() == x else x)


##### Test

In [523]:
twitter_archive_cp.name.sample(10)

685        Leo
1867      None
346       None
939       None
782     Finley
394       Toby
2035     Oscar
971      Lilah
107      Rover
746       None
Name: name, dtype: object

In [524]:
twitter_archive_cp.name[twitter_archive_cp.name.str.lower() == twitter_archive_cp.name]

Series([], Name: name, dtype: object)

#### Issue #5, #6, #7:

##### Define: 
- Tweet_ID = 740373189193256964 the dog has the rating is 14/10 
- Tweet_ID = 722974582966214656 the dog has the rating is 13/10
- Tweet_ID = 666287406224695296 the dog has the rating is 9/10

##### Code

In [525]:
# set function to chanfe value numerator, denominator by twwet_id
def getIndex(tweet_id):
    return twitter_archive_cp[twitter_archive_cp.tweet_id == tweet_id].index.values[0]    

In [526]:
list_id = [740373189193256964, 722974582966214656, 666287406224695296]
list_no =[14,13,9]
list_de = [10]*3

In [527]:
# change the cell value for numerator and denominator values
for i,j,k in zip(list_id, list_no, list_de):
    twitter_archive_cp.at[getIndex(i), 'rating_numerator'] = j
    twitter_archive_cp.at[getIndex(i), 'rating_denominator'] = k


##### Test

In [528]:
twitter_archive_cp[twitter_archive_cp.tweet_id.isin(list_id)]


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1068,740373189193256964,,,2016-06-08 02:41:38+00:00,"<a href=""http://twitter.com/download/iphone"" r...","After so many requests, this is Bretagne. She ...",,,,https://twitter.com/dog_rates/status/740373189...,14,10,,,,,
1165,722974582966214656,,,2016-04-21 02:25:47+00:00,"<a href=""http://twitter.com/download/iphone"" r...",Happy 4/20 from the squad! 13/10 for all https...,,,,https://twitter.com/dog_rates/status/722974582...,13,10,,,,,
2335,666287406224695296,,,2015-11-16 16:11:11+00:00,"<a href=""http://twitter.com/download/iphone"" r...",This is an Albanian 3 1/2 legged Episcopalian...,,,,https://twitter.com/dog_rates/status/666287406...,9,10,,,,,


#### Issue #8:

##### Define: Tweet_ID = 775096608509886464 does not exist, it is a duplication of Tweet_ID = 740373189193256964
- Remove Tweet_ID = 775096608509886464 out of dataset.

##### Code

In [529]:
# remove the row with this tweet id
twitter_archive_cp = twitter_archive_cp.loc[twitter_archive_cp.tweet_id != 775096608509886464]

##### Test

In [530]:
twitter_archive_cp.loc[twitter_archive_cp.tweet_id == 775096608509886464]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


#### Issue 9: 

##### Define: Remove data in twitter-archive-enhancement.csv for matching with the image_prediction.tsv - only data not beyond August 1st, 2017

##### Code

In [531]:
twitter_archive_cp = twitter_archive_cp.loc[twitter_archive_cp.timestamp <= '2017-08-01']

##### Test

In [532]:
twitter_archive_cp.loc[twitter_archive_cp.timestamp > '2017-08-01']

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


### Tidiness issue

#### Issue #1: 

##### Define: Unpivot doggo, floofer, pupper, puppo into 1 column

##### Code

In [533]:
twitter_archive_cp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2352 entries, 2 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2352 non-null   int64              
 1   in_reply_to_status_id       77 non-null     float64            
 2   in_reply_to_user_id         77 non-null     float64            
 3   timestamp                   2352 non-null   datetime64[ns, UTC]
 4   source                      2352 non-null   object             
 5   text                        2352 non-null   object             
 6   retweeted_status_id         180 non-null    float64            
 7   retweeted_status_user_id    180 non-null    float64            
 8   retweeted_status_timestamp  180 non-null    object             
 9   expanded_urls               2294 non-null   object             
 10  rating_numerator            2352 non-null   int64           

In [534]:
dog_type = ['doggo', 'floofer', 'pupper', 'puppo']

In [565]:
# unpivot columns in dog_type
df1 = pd.melt(twitter_archive_cp, id_vars=twitter_archive_cp.columns.difference(dog_type), var_name='dog_type_name_column', value_name='dog_type')

In [571]:
# split dataset into None dog_type and dog_type in ['doggo', 'floofer', 'pupper', 'puppo'] and remove duplicates
df2 = df1.drop('dog_type_name_column', axis=1).loc[df1.dog_type == 'None'].drop_duplicates() # all tweet_id
df3 = df1.loc[df1.dog_type == df1.dog_type_name_column].drop('dog_type_name_column', axis=1).drop_duplicates() # tweet_id having dogtype

In [595]:
# df4 will exclude tweet_id having dogtype from df2 (real None dog type)
df4 = df2.loc[~df2.tweet_id.isin(df3.tweet_id)]
twitter_archive_cp = pd.concat([df3, df4], axis=0)

In [603]:
# reset index for twitter_archive_cp
twitter_archive_cp.reset_index(drop=True)

Unnamed: 0,expanded_urls,in_reply_to_status_id,in_reply_to_user_id,name,rating_denominator,rating_numerator,retweeted_status_id,retweeted_status_timestamp,retweeted_status_user_id,source,text,timestamp,tweet_id,dog_type
0,https://twitter.com/dog_rates/status/890240255...,,,Cassie,10,14,,,,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,2017-07-26 15:59:51+00:00,890240255349198849,doggo
1,https://twitter.com/dog_rates/status/884162670...,,,Yogi,10,12,,,,"<a href=""http://twitter.com/download/iphone"" r...",Meet Yogi. He doesn't have any important dog m...,2017-07-09 21:29:42+00:00,884162670584377345,doggo
2,https://twitter.com/dog_rates/status/872967104...,,,,10,12,,,,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very large dog. He has a date later. ...,2017-06-09 00:02:31+00:00,872967104147763200,doggo
3,https://twitter.com/dog_rates/status/871515927...,,,Napolean,10,12,,,,"<a href=""http://twitter.com/download/iphone"" r...",This is Napolean. He's a Raggedy East Nicaragu...,2017-06-04 23:56:03+00:00,871515927908634625,doggo
4,https://twitter.com/animalcog/status/871075758...,,,,10,14,,,,"<a href=""http://twitter.com/download/iphone"" r...",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,2017-06-03 20:33:19+00:00,871102520638267392,doggo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2361,https://twitter.com/dog_rates/status/666049248...,,,,10,5,,,,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,2015-11-16 00:24:50+00:00,666049248165822465,
2362,https://twitter.com/dog_rates/status/666044226...,,,,10,6,,,,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,2015-11-16 00:04:52+00:00,666044226329800704,
2363,https://twitter.com/dog_rates/status/666033412...,,,,10,9,,,,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,2015-11-15 23:21:54+00:00,666033412701032449,
2364,https://twitter.com/dog_rates/status/666029285...,,,,10,7,,,,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,2015-11-15 23:05:30+00:00,666029285002620928,


##### Test

In [604]:
twitter_archive_cp['dog_type'].unique()

array(['doggo', 'floofer', 'pupper', 'puppo', 'None'], dtype=object)

In [605]:
twitter_archive_cp.shape

(2366, 14)

#### Issue #2:

##### Define: Match image prediction columns in image_prediction.tsv to define dog tweet_id and columns retweet count and favorite count in tweet_api for analysis

In [606]:
image_prediction_cp.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [607]:
tweet_api_cp.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,6953,33614
1,892177421306343426,5256,29150
2,891815181378084864,3462,21935
3,891689557279858688,7163,36677
4,891327558926688256,7703,35066


In [608]:
# merge prediction result from image_prediction_cp into twitter_archive_cp
twitter_archive_cp = twitter_archive_cp.merge(image_prediction_cp[['tweet_id', 'p1_dog', 'p2_dog', 'p3_dog']], how='left', on='tweet_id')


In [609]:
# merge retweet count and favorite count into twitter_archive_cp
twitter_archive_cp = twitter_archive_cp.merge(tweet_api_cp, how='left', on='tweet_id')

##### Test

In [610]:
twitter_archive_cp.head()

Unnamed: 0,expanded_urls,in_reply_to_status_id,in_reply_to_user_id,name,rating_denominator,rating_numerator,retweeted_status_id,retweeted_status_timestamp,retweeted_status_user_id,source,text,timestamp,tweet_id,dog_type,p1_dog,p2_dog,p3_dog,retweet_count,favorite_count
0,https://twitter.com/dog_rates/status/890240255...,,,Cassie,10,14,,,,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,2017-07-26 15:59:51+00:00,890240255349198849,doggo,True,True,True,6051.0,27782.0
1,https://twitter.com/dog_rates/status/884162670...,,,Yogi,10,12,,,,"<a href=""http://twitter.com/download/iphone"" r...",Meet Yogi. He doesn't have any important dog m...,2017-07-09 21:29:42+00:00,884162670584377345,doggo,True,True,True,2489.0,17836.0
2,https://twitter.com/dog_rates/status/872967104...,,,,10,12,,,,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very large dog. He has a date later. ...,2017-06-09 00:02:31+00:00,872967104147763200,doggo,True,True,True,4539.0,23919.0
3,https://twitter.com/dog_rates/status/871515927...,,,Napolean,10,12,,,,"<a href=""http://twitter.com/download/iphone"" r...",This is Napolean. He's a Raggedy East Nicaragu...,2017-06-04 23:56:03+00:00,871515927908634625,doggo,True,True,False,2922.0,17799.0
4,https://twitter.com/animalcog/status/871075758...,,,,10,14,,,,"<a href=""http://twitter.com/download/iphone"" r...",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,2017-06-03 20:33:19+00:00,871102520638267392,doggo,,,,4641.0,18447.0


## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [614]:
if os.path.exists('twitter_archive_master.csv'):
    os.remove('twitter_archive_master.csv') 
    twitter_archive_cp.to_csv('twitter_archive_master.csv', index=False)
else: 
    twitter_archive_cp.to_csv('twitter_archive_master.csv', index=False)

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [615]:
# import master file
master = pd.read_csv(r'C:\Users\DE\Documents\Data wrangling_Udacity\twitter_archive_master.csv')

In [616]:
master.head()

Unnamed: 0,expanded_urls,in_reply_to_status_id,in_reply_to_user_id,name,rating_denominator,rating_numerator,retweeted_status_id,retweeted_status_timestamp,retweeted_status_user_id,source,text,timestamp,tweet_id,dog_type,p1_dog,p2_dog,p3_dog,retweet_count,favorite_count
0,https://twitter.com/dog_rates/status/890240255...,,,Cassie,10,14,,,,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,2017-07-26 15:59:51+00:00,890240255349198849,doggo,True,True,True,6051.0,27782.0
1,https://twitter.com/dog_rates/status/884162670...,,,Yogi,10,12,,,,"<a href=""http://twitter.com/download/iphone"" r...",Meet Yogi. He doesn't have any important dog m...,2017-07-09 21:29:42+00:00,884162670584377345,doggo,True,True,True,2489.0,17836.0
2,https://twitter.com/dog_rates/status/872967104...,,,,10,12,,,,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very large dog. He has a date later. ...,2017-06-09 00:02:31+00:00,872967104147763200,doggo,True,True,True,4539.0,23919.0
3,https://twitter.com/dog_rates/status/871515927...,,,Napolean,10,12,,,,"<a href=""http://twitter.com/download/iphone"" r...",This is Napolean. He's a Raggedy East Nicaragu...,2017-06-04 23:56:03+00:00,871515927908634625,doggo,True,True,False,2922.0,17799.0
4,https://twitter.com/animalcog/status/871075758...,,,,10,14,,,,"<a href=""http://twitter.com/download/iphone"" r...",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,2017-06-03 20:33:19+00:00,871102520638267392,doggo,,,,4641.0,18447.0


### Insights:
1.

2.

3.

### Visualization