## Wrangle and Analyze Data

Real-world data rarely comes clean. Using Python and its libraries, we will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it.


## Table of Contents
- [Introduction](#intro)
- [Part I - Gathering Data](#gather)
- [Part II - Assessing Data](#assess)
- [Part III - Cleaning Data](#clean)


<a id='intro'></a>
### Introduction

For this project I will be wrangling WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive contains only very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations. 

WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

<a id='gather'></a>
#### Part I - Gathering Data

In this section I will gather data using 3 methodologies:
1. *The WeRateDogs Twitter archive*. __Downloaded manually__ from the server. 

2. *The tweet image predictions*, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be __downloaded programmatically__ using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, __query the Twitter API__ for each tweet's JSON data __using Python's Tweepy library__ and store each tweet's entire set of JSON data in a file called *tweet_json.txt* file.


Let's start with importing neccessary libraries.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import requests
import os
import tweepy
import json

In [2]:
# Read WeRateDogs Twitter archive csv file
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
# First check if the data is properly loaded
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### Download tweet image predictions tsv file from the Udacity's server

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)

with open(url.split('/')[-1], 'wb') as f:
    f.write(r.content)

In [4]:
# Read the tweet image predictions tsv file
twitter_images = pd.read_csv('image-predictions.tsv', sep = '\t')

In [5]:
# First check if the data is properly loaded
twitter_images.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


#### Twitter authentification setup

consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

#### Download Tweets part 1

--- List for storing json outputs
tweet_list = []
--- List for storing Twitter IDs where the status has not been found
tweet_errors = []

--- Loop through Twitter IDs from the twitter archive file
--- Append json outputs to tweet_list and not found Twitter IDs to 'tweet_errors' list
for tweet_id in twitter_archive.tweet_id:
    try:
        tweet = api.get_status(tweet_id, tweet_mode='extended')
        tweet_list.append(tweet._json)
    except Exception as e:
        print(str(tweet_id)+ " _ " + str(e))
        tweet_errors.append(tweet_id)

#### Download Tweets part 2

--- Additional list for storing Twitter IDs where the status has not been found the second time
missing_list = []

--- Loop through the list of missing IDs
--- Append json outputs to tweet_list and not found Twitter IDs to missing_list
for missing_id in tweet_errors:
    try:
        tweet_list.append(tweet._json)
    except Exception as ex:
        print(str(missing_id)+ " _ " + str(ex))
        missing_list.append(ex)

#### Create and safe dataframe
--- Create DataFrames from list of dictionaries
json_df = pd.DataFrame(tweet_list)

--- Save the dataFrame in file
json_df.to_csv('tweet_json.txt', index=False)

In [6]:
# Read tweet_json.txt file
twitter_json = pd.read_csv('tweet_json.txt', encoding = 'utf-8')

In [7]:
# First check if the data is properly loaded
twitter_json.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,Tue Aug 01 16:23:56 +0000 2017,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",38126,False,This is Phineas. He's a mystical boy. Only eve...,,...,,,,,8339,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,Tue Aug 01 00:17:27 +0000 2017,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32703,False,This is Tilly. She's just checking pup on you....,,...,,,,,6162,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,Mon Jul 31 00:18:03 +0000 2017,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24623,False,This is Archie. He is a rare Norwegian Pouncin...,,...,,,,,4078,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,Sun Jul 30 15:58:51 +0000 2017,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",41473,False,This is Darla. She commenced a snooze mid meal...,,...,,,,,8484,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,Sat Jul 29 16:00:24 +0000 2017,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",39639,False,This is Franklin. He would like you to stop ca...,,...,,,,,9168,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


<a id='assess'></a>
### Part II - Assessing  Data

After gathering each of the above pieces of data, our task is to assess them visually and programmatically for quality and tidiness issues. We should detect and document at least eight (8) quality issues and two (2) tidiness issues in the wrangle_act.ipynb Jupyter Notebook.

#### Step 1: Detect

**a. Visual Assesment**

In [None]:
# Visual assesment of the 1st dataframe: twitter-archive-enhanced.csv
twitter_archive

In [None]:
# Visual assesment of the 2nd dataframe: image-predictions.tsv
twitter_images

In [None]:
# Visual assesment of the 3rd dataframe: tweet_json.txt
twitter_json

**a. Programmatic Assessment**

In [None]:
# View 10 random records from the twitter_archive dataframe
twitter_archive.sample(10)

In [None]:
# View info of twitter_archive dataframe
twitter_archive.info()

In [None]:
# View descriptive statistics of twitter_archive dataframe
twitter_archive.describe()

In [None]:
# View 10 random records from the twitter_images dataframe
twitter_images.sample(10)

In [None]:
# View info of twitter_images dataframe
twitter_images.info()

In [None]:
# View descriptive statistics of twitter_json dataframe
twitter_images.describe()

In [None]:
# View 10 random records from the twitter_json dataframe
twitter_json.sample(10)

In [None]:
# View info of twitter_json dataframe
twitter_json.info()

In [None]:
# View descriptive statistics of twitter_json dataframe
twitter_json.describe()

#### Step 2: Document

**a. Quality**

**b. Tidiness**