*Marcello Victorino* <br>
*04/30/2019* -  

# Introduction
This project is part of a requirement to graduate in the Udacity's Data Analyst Nanodegree (*DAND*).

It provides the opportunity to implement Data Wrangling in practice by gathering data from different sources, assessing it for quality and tidiness issues and then promote the necessary cleaning task - programatically.

Finally, once the data is properly cleaned and stored, a brief analysis is conducted with visualizations, highlighting interesting insights.

The data for this project was provided in partnership with the **WeRateDogs** channel from twitter, containing over 5,000 observations about dogs.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline

## Gather

In [2]:
# Gathering Twitter Enhanced Archive data
df_archive = pd.read_csv('twitter-archive-enhanced-2.csv')
df_archive.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


In [54]:
# Gathering Image Predictions for Dog Breed - Available online
import requests
import os

# Avoid redownloading if file already saved locally
if 'dog_breed.txt' in os.listdir():
    pass

else:
    url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

    # Accessing file online
    r = requests.get(url)

    # Saving content locally
    with open('dog_breed.txt', 'wb') as fh:
        fh.write(r.content)

# Reading file as Dataframe
df_breed = pd.read_csv('dog_breed.txt', sep='\t')
df_breed.to_csv('dog_breed.csv', index=False)
df_breed.head(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True


In [4]:
# Gathering data from Twitter API
import tweepy
from twitter_secret_credentials import Twitter_API_Authenticate # personal script

# Authenticating connection to Twitter API
api = Twitter_API_Authenticate() # wait_on_rate_limit=True, wait_on_rate_limit_notify=True

In [5]:
# Retrieve data from Twitter ID (JSON format)
import json
from tqdm import tqdm_notebook as progressbar

# Avoid redownloading if file already saved locally
if 'tweet_json.txt' in os.listdir():
    print('JSON data has already been downloaded.')
    pass

else:
    # Retrieve data from Twitter API and save locally
    fails = []
    count = 0
    with open('tweet_json.txt', 'w') as file:
        for tweet_id in progressbar(df_archive.tweet_id[:]):
            count += 1
            try:
                tweet = api.get_status(tweet_id, tweet_mode='extended')
                json.dump(tweet._json, file)
                file.write('\n') # important to separate each tweet  

            except:
                fails.append(tweet_id)

    fail_percentage = len(fails)/count
    print(f'Successfully read: {(1 - fail_percentage):.0%}') # 19 tweets could not be read | 25 minutes

JSON data has already been downloaded.


In [53]:
# Actually working with the JSON data extracted

# Avoid duplicating work if data already parsed and saved locally
if 'tweet_parsed_data.csv' in os.listdir():
    print('Data already parsed and saved.')
    df_tweet = pd.read_csv('tweet_parsed_data.csv')

else:
    with open('tweet_json.txt', 'r') as file:
        tweet_jsons = file.readlines()

    # Iterating over each individual tweet
    tweet_data = list()

    for tweet in tweet_jsons:
        data = dict()

        js = json.loads(tweet) # Reading each tweet string as JSON
        
        # Skip text starting with "RT" or "@"
        if js['full_text'].startswith(('RT', '@')):
            continue

    #         print(json.dumps(js, indent=4)) # Pretty printing JSON
        data['id'] = js['id_str']
        data['created'] = js['created_at']
        data['retweet'] = js['retweet_count']
        data['favorite'] = js['favorite_count']
        data['text'] = js['full_text']

        tweet_data.append(data)

    # Reading data into Dataframe
    df_tweet = pd.DataFrame(tweet_data, columns=data.keys())

    # Extract dog's name
    df_tweet['name'] = df_tweet.text.str.extract(' ([A-Z][a-z]*)\.')

    # Extract rating
    df_tweet['rate'] = df_tweet.text.str.extract('([0-9]*)/[0-9]{2}')

    # Transforming df_tweet.created as Datetime
    df_tweet.created = pd.to_datetime(df_tweet.created)

    # Extract datetime from Created feature
    df_tweet['year'] = df_tweet.created.dt.year
    df_tweet['month'] = df_tweet.created.dt.month
    df_tweet['weekday'] = df_tweet.created.dt.day_name()
    df_tweet['hour'] = df_tweet.created.dt.hour

    # Saving it locally
    df_tweet.to_csv('tweet_parsed_data.csv', index=False)

df_tweet.head(2)

Unnamed: 0,id,created,retweet,favorite,text,name,rate,year,month,weekday,hour
0,892420643555336193,2017-08-01 16:23:56+00:00,8197,37569,This is Phineas. He's a mystical boy. Only eve...,Phineas,13,2017,8,Tuesday,16
1,892177421306343426,2017-08-01 00:17:27+00:00,6060,32304,This is Tilly. She's just checking pup on you....,Tilly,13,2017,8,Tuesday,0
