# Project: Wrangling and Analyze Data

In [1]:
#Declare libraries relevant to this analysis
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import tweepy
import json



## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [2]:
tweets = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
#Check to ensure data was loaded correctly
tweets.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892421000000000000,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177000000000000,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


In [4]:
tweets.dtypes

tweet_id                        int64
in_reply_to_status_id         float64
in_reply_to_user_id           float64
timestamp                      object
source                         object
text                           object
retweeted_status_id           float64
retweeted_status_user_id      float64
retweeted_status_timestamp     object
expanded_urls                  object
rating_numerator                int64
rating_denominator              int64
name                           object
doggo                          object
floofer                        object
pupper                         object
puppo                          object
dtype: object

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [5]:
#Download the image_predictions.tsv file from the stated weblink and save to this computer
link = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
tsv_file = requests.get(link)

with open('image_predictions.tsv', 'wb') as f:
    f.write(tsv_file.content)


In [6]:
#Upload the dataset into pandas Dataframe
#Check to be sure dataset was downloaded properly
image_predictions = pd.read_csv('image_predictions.tsv', sep = "\t")
image_predictions.head(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [None]:
# Twitter API access credentials
conKey = ""
conSecret = ""
accessToken = ""
accessTokenSecret = ""

In [None]:
# authorization of consumer key and consumer secret
auth = tweepy.OAuthHandler(conKey, conSecret)

# set access to user's access key and access secret
auth.set_access_token(accessToken, accessTokenSecret)

# calling the api
api = tweepy.API(auth, wait_on_rate_limit=True)


In [None]:
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor

from timeit import default_timer as timer
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json_Main.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepyException as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

In [None]:
dog_tweet = []
with open('tweet_json.txt', 'r') as file:
    for line in file:
        getTweets = json.loads(line)

        dog_tweet.append({'tweet_id': getTweets['id'],
                      'author_id': getTweets['user']['id'],
                      'statuses_count': getTweets['user']['statuses_count'],
                      'favorite_count': getTweets['favorite_count'],
                      'retweet_count': getTweets['retweet_count'],
                      'tweet_date_posted': getTweets['created_at']})
df_Tweet = pd.DataFrame.from_dict(dog_tweet)

In [None]:
df_Tweet.head(2)

The following dataframe were created to hold data from the three different sources:
tweets  # For Dataframe containing 'twitter-archive-enhanced.csv'
image_predictions   # For DataFrame containing 'image_predictions.tsv' data
df_Tweet #For Dataframe for more info such as favourite_count


In [9]:
tweets.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892421000000000000,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177000000000000,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815000000000000,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


In [None]:
image_predictions.head(3)

In [None]:
df_Tweet.head(3)

In [None]:
# Make copies of original pieces of data. This ensures that the original dataset is not altered.
df_TweetDetails = df_Tweet.copy() #Make a copy of the Dataframe for more info such as favourite_count
df_image_predictions = image_predictions.copy()   # Make a copy of the DataFrame containing 'image_predictions.tsv' data
df_twitter_archive_enhanced= tweets.copy()  # Make a copy of the Dataframe containing 'twitter-archive-enhanced.csv'

In [None]:
df_TweetDetails.head(2)

In [None]:
df_image_predictions.head(2)

In [None]:
df_twitter_archive_enhanced.head(2)

## Assessing Data

### Quality issues

1. Several fields had Retweets, which are not needed for the analysis. Notably, 181 retweets were identified in the ‘twitter_archive_enhanced’ dataset.

2. Some attributes have too many NULL entries. For example, ‘retweeted_status_id’ attribute in ‘twitter_archive_enhanced’ dataset, has 2175 NULL entries with only 181 having contents.

3. ‘tweet_id’ is an attribute that is found in all three datasets. Hence, it can adequately be considered the primary key, or central point of reference. If so, we note that there are 7 duplicates of ‘tweet_id’ in the ‘twitter_archive_enhanced’ dataset.

4. Visual assessment shows that there are several records without meaningful dog names. For example, ‘a’, ‘none’, ‘an’, e.t.c are not meaningful names.

5. Visual assessment shows that there are several records without any dog stage indicated. Meaning, all four stages are indicated ‘None’. This type of data requires attention for meaningful analysis.

6. We compare the ‘twitter_archive_enhanced’ and ‘image_predictions’ datasets. It is noticed that there are NO NULL entries in the ‘image_predictions’ dataset unlike the ‘twitter_archive_enhanced’, This will pose a problem when attempting to merge both datasets.

7. It can be observed that the date fields (‘timestamp’, ‘retweeted_status_timestamp’) are not in the proper formats. It is currently a string, rather than ‘datetime’ format.

8. Also, the ‘tweet_id’ datatype is shown as int64, meaning it erroneously sees it as a numeric type. However, it is better as string datatype.

9. While the ‘text’ attribute in ‘twitter_archive_enhanced’ contains the tweets and also the URL to the dog picture. It will be a better quality if the tweets and URL are on seperate cells.


### Tidiness issues

Issues with the structure of the dataset determines its state of tidiness.
Hence, we review the structure of the three (3) datasets gathered using different methods.

The following observations were made:
1. It may be observed that 'doggo', 'floofer', 'pupper', and 'puppo' are various dog stages. Hence, it is structural deformity to have them standing as attributes/columns.
2. The datatype of TweetDetails dataframe is not correct. The datatype is different from that of similar fields in the other dataframes; it is better similar fields all have the same datatypes. For example, tweet_id is an integer variable, whereas it should be an object/string in similarity with the other dataframes.

3. There are three seperate files, one of which contains artificial intelligent predictions. All relevant records are better together in one file, if possible; preferably containing the prediction labels. This different file documents for relevant records may be considered untidy.

In [None]:
df_twitter_archive_enhanced.shape

In [None]:
df_image_predictions.shape

In [None]:
df_TweetDetails.shape

In [None]:
df1 = df_image_predictions.loc[df_image_predictions['tweet_id'] & df_TweetDetails['tweet_id']]

In [None]:
df1.shape

In [None]:
df2 = df_image_predictions.loc[df_image_predictions['tweet_id'] & df_twitter_archive_enhanced['tweet_id']]

In [None]:
df2.shape

In [None]:
df3 = df_twitter_archive_enhanced.loc[df_image_predictions['tweet_id'] & df_twitter_archive_enhanced['tweet_id']]

In [None]:
df3.shape

In [None]:
df4 = df_twitter_archive_enhanced.loc[df_TweetDetails['tweet_id'] & df_twitter_archive_enhanced['tweet_id']]

In [None]:
df4.shape

In [None]:
df5 = df_twitter_archive_enhanced.loc[df_TweetDetails['tweet_id'] & df_twitter_archive_enhanced['tweet_id'] & df_image_predictions['tweet_id']]

In [None]:
df5.shape

## Cleaning Data

In [None]:
# Make copies of original pieces of data. This ensures that the original dataset is not altered.
df_Clean_TweetDetails = df_Tweet.copy() #Make a copy of the Dataframe for more info such as favourite_count
df_Clean_image_predictions = image_predictions.copy()   # Make a copy of the DataFrame containing 'image_predictions.tsv' data
df_Clean_twitter_archive_enhanced = tweets.copy()  # Make a copy of the Dataframe containing 'twitter-archive-enhanced.csv'

### Issue #1:


#### Define:
Several fields had Retweets, which are not needed for the analysis. Notably, 181 retweets were identified in the ‘twitter_archive_enhanced’ dataset.

#### Code

In [None]:
# Check the number of non-null values in the dataframe
df_Clean_twitter_archive_enhanced['retweeted_status_id'].notnull().sum()

In [None]:
df_Clean_twitter_archive_enhanced.shape

In [None]:
df_Clean_twitter_archive_enhanced.isnull().sum()

In [None]:
#Remove all records with retweet values, leaving only those with "NaN"
df_Clean_twitter_archive_enhanced.query('retweeted_status_id == "NaN"', inplace=True)

#### Test

In [None]:
df_Clean_twitter_archive_enhanced.shape

In [None]:
df_Clean_twitter_archive_enhanced.info()

### Issue #2:

Some attributes have too many NULL entries. For example, 'retweeted_status_id' attribute in ‘twitter_archive_enhanced’ dataset, has 2175 NULL entries with only 181 having contents. To solve this problem, we delete attributes with all NULL entries;
retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp

#### Code

In [None]:
df_Clean_twitter_archive_enhanced.info();

In [None]:
df_Clean_twitter_archive_enhanced.drop(columns=['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], inplace = True)

#### Test

In [None]:
#confirm that the unwanted columns have been removed
df_Clean_twitter_archive_enhanced.info()

### Issue #3:

#### Define
‘tweet_id’ is an attribute that is found in all three datasets. Hence, it can adequately be considered the primary key, or central point of reference. If so, we note that there are 7 duplicates of ‘tweet_id’ in the ‘twitter_archive_enhanced’ dataset. On closer look into the duplicate data, we notice that they are either retweets or replies; this will be removed.

#### Code

In [None]:
#Identify if there are any duplicates
df_Clean_twitter_archive_enhanced['tweet_id'].shape[0] - df_Clean_twitter_archive_enhanced['tweet_id'].unique().shape[0]


In [None]:
#Another way to confirm if there are any duplicates
df_Clean_twitter_archive_enhanced['tweet_id'].duplicated().value_counts()

In [None]:
# A closer look of the data for the duplicated tweet_id is necessary for better understanding
#dup = pd.DataFrame
dup = df_Clean_twitter_archive_enhanced[df_Clean_twitter_archive_enhanced['tweet_id'].duplicated(False)]

In [None]:
dup

On a closer examination, it can be observed that some of the duplicate 'tweet_id' resulted from replies. Hence, we delete every of such replies.

In [None]:
#Remove all records with reply values, leaving only those with "NaN"
df_Clean_twitter_archive_enhanced.query('in_reply_to_status_id == "NaN"', inplace=True)

In [None]:
df_Clean_twitter_archive_enhanced.shape

In [None]:
df_Clean_twitter_archive_enhanced.tail(3)

#### Test

In [None]:
#After all records containing reply tweets are removed, notice that the duplicates are almost totally gone.
dup = df_Clean_twitter_archive_enhanced[df_Clean_twitter_archive_enhanced['tweet_id'].duplicated(False)]

In [None]:
dup


Further examination shows that one of the duplicates must have been an error since 'name' was 'None'. This record is to be deleted.

In [None]:
#Delect row for index=1871 because it is obviously the duplicate considering the timestamp of post and the tweet text
df_Clean_twitter_archive_enhanced.drop(index=1871, inplace=True)

In [None]:
#We confirm that there is no more duplicates
df_Clean_twitter_archive_enhanced['tweet_id'].duplicated(False).sum()

### Issue #4:

#### Define
Visual assessment shows that there are several records without meaningful dog names. For example, ‘a’, ‘his’, ‘an’, e.t.c are not meaningful names. Further examination with programmatic assessment reveals the number of unique names as 955, and meaningless names indicated too.

#### Code

In [None]:
# programmatically assess the number of unique names
df_Clean_twitter_archive_enhanced.name.unique()

In [None]:
df_Clean_twitter_archive_enhanced.shape

#The following names were discovered in the dataset and considered inappropraite
#['a', 'an', 'the', 'by', 'his', 'one', 'infuriating','officially', 'all', 'space', 'O', 'such', 'incredibly', 'life', 'my', 'this', 'old', 'mad', 'unacceptable']

In [None]:
df_Clean_twitter_archive_enhanced.query("name == ['a', 'an', 'the', 'by', 'his', 'one', 'infuriating','officially', 'all', 'space', 'O', 'such', 'incredibly', 'life', 'my', 'this', 'old', 'mad', 'unacceptable']")  # 1138, 775,22, 852,1206

In [None]:
df_Clean_twitter_archive_enhanced['name'].value_counts()

In [None]:
#Replace all single character names with 'None'
df_Clean_twitter_archive_enhanced['name'].replace(['a', 'an', 'the', 'by', 'his', 'one', 'infuriating','officially', 'all', 'space', 'O', 'such', 'incredibly', 'life', 'my', 'this', 'old', 'mad', 'unacceptable', 'very', 'any', 'quite', 'not', 'very', 'just', 'getting'], value= 'None', inplace=True)

In [None]:
#Review the effect of the changes
df_Clean_twitter_archive_enhanced['name'].value_counts()

#Alternatively, we could have used regular expression to omit all names starting with lowercase.
#It will be recalled that names are Nouns and should start with an uppercase letter.
df_Clean_twitter_archive_enhanced['name'].replace(regex= "[a-z]\w", value= 'None', inplace=True)

In [None]:
# Extract fields with 'None' from the dataframe
# This particular dataframe will store the records that programmatically extracts names from text
tx_none = df_Clean_twitter_archive_enhanced[df_Clean_twitter_archive_enhanced['name'] == 'None'].copy()

In [None]:
#Confirm that the correct data was copied
tx_none.head(2)

In [None]:
# Extract fields with available name values from the dataframe
tx_value = df_Clean_twitter_archive_enhanced[df_Clean_twitter_archive_enhanced['name'] != 'None'].copy()

In [None]:
#Confirm that the correct data was copied
tx_value.head(2)

In [None]:
# View the content of the dataframe that stores 'None' values in the 'name' field
tx_none['name'].value_counts()

In [None]:
# View the content of the dataframe that contains actual values in the 'name' field
tx_value['name'].value_counts()

In [None]:
# Function to determine the appropraite dog stage from the concatenated attributes
def get_dog_name(val):
    ret = 'None'
    tweet_text = val.split()
    for w in tweet_text:
        if w == 'named':
            i= tweet_text.index(w) + 1
            ret = tweet_text[i].strip('.')
            #print(ret)
    return ret


In [None]:
# Extract any name of dog that appears in the tweet text and store in 'name' field
tx_none['name']= tx_none['text'].apply(lambda x: get_dog_name(x))

In [None]:
# Preview the unique names extracted
tx_none['name'].value_counts()

In [None]:
# Check the frequency of values in the name field before combining
# the dataframe where 'None' was replaced with names extracted from the tweet text
df_Clean_twitter_archive_enhanced.name.value_counts()

In [None]:
# Empty the original dataframe to prevent unforseen issues when concatenating
df_Clean_twitter_archive_enhanced = pd.DataFrame

In [None]:
# Concatenate both dataframes with/without 'None' entries and replace the original dataframe
df_Clean_twitter_archive_enhanced = pd.concat([tx_none,tx_value], ignore_index=False).sort_index()

In [None]:
df_Clean_twitter_archive_enhanced.head()

#### Test

In [None]:
tx_none.shape

In [None]:
tx_value.shape

In [None]:
# Confirm that the total number of rows = rows of tx_value and tx_none
df_Clean_twitter_archive_enhanced.shape

In [None]:
# Check the frequency of values in the name field AFTER the merger
df_Clean_twitter_archive_enhanced.name.value_counts()

In [None]:
df_Clean_twitter_archive_enhanced.head(10)

### Issue #5:

#### Define
Visual assessment shows that there are several records without any dog stage indicated. Meaning, all four stages are indicated 'None'. This type of data requires attention for meaningful analysis.

#### Code

In [None]:
df_Clean_twitter_archive_enhanced.tail(3)

In [None]:
noName_noStage = df_Clean_twitter_archive_enhanced.query('name =="None" & doggo =="None" & floofer=="None" & pupper =="None" & puppo=="None"')

In [None]:
noName_noStage

In [None]:
df_Clean_twitter_archive_enhanced.info();

In [None]:
ind = df_Clean_twitter_archive_enhanced.query('name =="None" & doggo =="None" & floofer=="None" & pupper =="None" & puppo=="None"').index

In [None]:
len(ind)

With 1245rows having no stage of dog reported, it may not be ideal to delete affected rows because of its significance. However, 530rows neither had valid names (i.e name=None) nor dog stages. These rows are deleted.

In [None]:
df_Clean_twitter_archive_enhanced.drop(index=ind, inplace=True)

#### Test

In [None]:
df_Clean_twitter_archive_enhanced.info()

### Issue #6:

#### Define
We compare the ‘twitter_archive_enhanced’ and ‘image_predictions’ datasets. It is noticed that there are NO NULL entries in the ‘image_predictions’ dataset unlike the ‘twitter_archive_enhanced’, This will pose a problem when attempting to merge both datasets.


#### Code


In [None]:
df_Clean_image_predictions.info()

In [None]:
df_Clean_twitter_archive_enhanced.info()

In [None]:
#Delete columns 'in_reply_to_status_id' and 'in_reply_to_user_id' since they don't have any value
df_Clean_twitter_archive_enhanced.drop(columns=['in_reply_to_status_id', 'in_reply_to_user_id'], inplace=True)

In [None]:
#Assessing the 'image_predictions' dataset shows that there are NO NULL entries
df_Clean_image_predictions.isnull().sum()

In [None]:
df_Clean_twitter_archive_enhanced.isnull().sum()

We now have a cleaner dataset, with NO NULL values. The null shown in 'expanded_urls' is negligible, as it is an extra URL that will not impact significantly on the analysis.

### Issue #7:

#### Define:
It can be observed that the date fields ('timestamp', 'retweeted_status_timestamp') are not in the proper formats. It is currently a string, rather than 'datetime' format.

#### Code

In [None]:
df_Clean_twitter_archive_enhanced.dtypes

In [None]:
df_Clean_twitter_archive_enhanced['timestamp'] = pd.to_datetime(df_Clean_twitter_archive_enhanced['timestamp'])

#### Test

In [None]:
df_Clean_twitter_archive_enhanced.dtypes

### Issue #8:

#### Define:

Also, the 'tweet_id' datatype is shown as int64, meaning it erroneously sees it as a numeric type. However, it is better as string datatype.

#### Code

In [None]:
#Convert the datatype from numeric to string
df_Clean_twitter_archive_enhanced['tweet_id'] = df_Clean_twitter_archive_enhanced['tweet_id'].astype(str)

In [None]:
#Convert the datatype from numeric to string
df_Clean_image_predictions['tweet_id'] = df_Clean_image_predictions['tweet_id'].astype(str)

#### Test

In [None]:
#Confirm the conversion of datatype from numeric to string
df_Clean_twitter_archive_enhanced.dtypes

In [None]:
#Confirm the conversion of datatype from numeric to string
df_Clean_image_predictions.dtypes

### Issue #9:

#### Define:
While the 'text' attribute in 'twitter_archive_enhanced' contains the tweets and also the URL to the dog picture. It will be a better quality if the tweets and URL are on seperate cells.

#### Code

In [None]:
df_Clean_twitter_archive_enhanced[['text', 'URL']] = df_Clean_twitter_archive_enhanced['text'].str.split(' https://', 1, expand=True)

#### Test

In [None]:
df_Clean_twitter_archive_enhanced.tail(3)

### Issue #10:

#### Define:
 It may be observed that 'doggo', 'floofer', 'pupper', and 'puppo' are various dog stages. Hence, it is structural deformity to have them as standing-alone attributes/columns.

To solve this problem, we create a column named 'dog_stage' where the valid stage name available in any of the existing columns is inputed. Then the other dog stages ('doggo', e.t.c) are deleted, leaving only the 'dog_stage' attribute.

#### Code

In [None]:
# Concatenate the related attributes into a single attribute
df_Clean_twitter_archive_enhanced['dog_stage'] = df_Clean_twitter_archive_enhanced['doggo'].str.cat(df_Clean_twitter_archive_enhanced[['pupper', 'puppo', 'floofer']], sep=' ')

In [None]:
# Function to determine the appropraite dog stage from the concatenated attributes
def get_dog_stage(val):
    if 'pupper' in val:
        ret = 'pupper'
    elif 'puppo' in val:
        ret = 'puppo'
    elif 'floofer' in val:
        ret = 'floofer'
    elif 'doggo' in val:
        ret = 'doggo'
    else:
        ret = np.NAN

    return ret


In [None]:
df_Clean_twitter_archive_enhanced['dog_stage'] = df_Clean_twitter_archive_enhanced['dog_stage'].apply(lambda x: get_dog_stage(x))

In [None]:
#The affected columns can now be deleted
df_Clean_twitter_archive_enhanced.drop(columns=['pupper', 'puppo', 'floofer', 'doggo'], inplace=True)

#### Test

In [None]:
df_Clean_twitter_archive_enhanced['dog_stage'].value_counts()

In [None]:
df_Clean_twitter_archive_enhanced.head(10)

### Issue #11:

#### Define:
The datatype of TweetDetails dataframe is not correct. The datatype is different from that of similar fields in the other dataframes; it is better similar fields all have the same datatypes. For example, tweet_id is an integer variable, whereas it should be an object/string in similarity with the other dataframes.

#### Code

In [None]:
# We examine the datatype for the TweetDetails dataframe
df_Clean_TweetDetails.dtypes

In [None]:
# Convert the datatype of tweet_id to string
df_Clean_TweetDetails.tweet_id = df_Clean_TweetDetails.tweet_id.astype(str)

In [None]:
# Convert the datatype of author_id to string
df_Clean_TweetDetails.author_id = df_Clean_TweetDetails.author_id.astype(str)

In [None]:
# Convert the datatype of tweet_date_posted to datetime format
df_Clean_TweetDetails['tweet_date_posted'] = pd.to_datetime(df_Clean_TweetDetails['tweet_date_posted'])

#### Test

In [None]:
# Confirm that the datatype changes for the TweetDetails dataframe has taken effect
df_Clean_TweetDetails.dtypes

Check the tweet_id for the other dataframes to be sure they are of same datatype and compatible.

In [None]:
df_Clean_twitter_archive_enhanced.dtypes

In [None]:
df_Clean_image_predictions.dtypes

### Issue #12:

#### Define:
The various data gathered had different structures; "twitter_archive_enhanced" data has (2356, 17),
    the "image_predictions" data has (2075, 12), while the extra dataset obtained via Twitter API has (2354, 6).
   There is inconsistencies in the structure of the datasets, which makes it quite untidy.

#### Code

In [None]:
#Examine the dataframe for the main twitter data
df_Clean_twitter_archive_enhanced.info()

In [None]:
#Examine the dataframe for the image predictions
df_Clean_image_predictions.info()

In [None]:
# Assess the tweet_ids that are common to both dataframes
df_Clean_TweetDetails.tweet_id.isin(df_Clean_image_predictions.tweet_id).value_counts()

In [None]:
# Assess the tweet_ids that are common to both dataframes
df_Clean_TweetDetails.tweet_id.isin(df_Clean_twitter_archive_enhanced.tweet_id).value_counts()

In [None]:
# Assess the tweet_ids that are common to both dataframes
df_Clean_twitter_archive_enhanced.tweet_id.isin(df_Clean_image_predictions.tweet_id).value_counts()

In [None]:
# Join the dataframes: df_Clean_image_predictions, df_Clean_TweetDetails
df_join_tweet_predict = df_Clean_image_predictions.merge(df_Clean_TweetDetails, how='inner', left_on='tweet_id', right_on='tweet_id')


#### Test


In [None]:
#Assess the structure of the combined dataframes
df_join_tweet_predict.info()

In [None]:
df_join_tweet_predict.head()

In [None]:
df_join_tweet_predict[['tweet_id','favorite_count', 'tweet_date_posted']].sort_values(ascending=False, by='favorite_count')

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
# Save the merged datasets containing tweet details and predictions
df_join_tweet_predict.to_csv("twitter_archive_master.csv")

In [None]:
# Save the cleaned version of twitter_archive_enhanced.
# It is saved seperately from the others because its tweet_ids are different; no match with others.
df_Clean_twitter_archive_enhanced.to_csv('Clean_twitter_archive_enhanced.csv')

## Analyzing and Visualizing Data
In this section, we analyze and visualize our wrangled data. We also report on some insights derived from the datasets.

### Insights:
1. First, it may be observed that all the tweets - with unique tweet_ids - predicted emanated from the same author ('4196983835').

2. There were four(4) dog stages mentioned in the tweets, namely Puppo, Doggo, Pupper, and Floofer. The dog stage that is most rated was the Pupper. It is however noted that most of the tweets did not specify what stage of the dog being rated.

3. Records shows the Floofer is the least dog stage of interest; it is least rated when compared to other dog stages.

4. The most liked tweet (132810), with tweet_id '822872901745569793' was posted on  2017-01-21 18:26:02+00:00.

In [None]:
# The dog stage that is most rated was the Pupper
df_Clean_twitter_archive_enhanced[['dog_stage', 'rating_numerator']].sort_values(by='rating_numerator', ascending=False, ignore_index=True)

In [None]:
df_join_tweet_predict['author_id'].value_counts()

In [None]:
df_join_tweet_predict['tweet_id'].nunique()

### Visualization

In this section, we visualize the different dog stages rated in the tweets. The data used for this visualization is from the dataset that has undergone assessment and cleaning. Hence, it is suitable for analysis and visualization.

In [None]:
dog_stages = df_Clean_twitter_archive_enhanced.dog_stage.value_counts()

In [None]:
dog_stages

In [None]:
dog_stages.plot(kind='bar', title='Chart of Dog stages', xlabel = 'Dog Stages', ylabel= 'Frequency', figsize=(8,8));


In [None]:
plt.title="Chart of Dog Stages"
plt.figure(figsize=(8,8))
plt.pie(dog_stages, labels=dog_stages.index);