## key Points

1. ***You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.***

2. ***the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset. Cleaning includes merging individual pieces of data according to the rules of tidy data.***


In [1]:
import requests
import pandas as pd
import os
import re
import matplotlib.pyplot as plt
import json
import datetime
import numpy as np
%matplotlib

Using matplotlib backend: TkAgg


## 1. Gather
- Image Prediction Data

In [2]:
# using requests library to download image predicition tsv file
url =  'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)
with open("image-predictions.tsv", mode = 'wb') as file:
     file.write(r.content)   

In [2]:
# reading the image prediction file
image_prediction = pd.read_csv('image-predictions.tsv', sep = '\t')

In [None]:
image_prediction.sample(10)

In [None]:
image_prediction.info()

In [None]:
tweet_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv' )

In [None]:
tweet_archive_enhanced.head(10)

In [None]:
tweet_archive.info()

In [None]:
tweet_data = pd.read_csv('tweet_json.txt', sep = " ")

In [None]:
tweets_id = []
fav_count = []
retweets_count = []
with open('tweet_data', mode = 'r') as file:
     for line in file.readlines():
            tweets_data = json.loads(line)
            tweets_id.append(tweet_data['id'])
            fav_count.append(tweet_data['favorite_count'])
            retweets_count.append(tweet_data['retweet_count'])
            
additional_tweet_data = pd.DataFrame({'id':tweets_id, 'favorite_count':fav_count, 'retweet_count':retweets_count})

In [None]:
additional_tweet_data

### Assessing Data for this Project
- #### After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least 'eight (8)' quality issues and 'two (2)' tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

### Cleaning Data for this Project
- #### Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

- #### Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

- #### Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

### Reporting for this Project
- #### Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

- Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

- #### Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.