# Report - Wrangle WeRateDogs Twitter Data 

The aim of this project was to gather data from the Twitter account WeRateDogs from multiple sources, clean this data, and analyze it. 

The tasks carried out in this project are as follows: <br>
* Data wrangling, having the following steps:
    * Gathering data .
    * Assessing data
    * Cleaning data
* Storing, analyzing, and visualizing the wrangled data
* Reporting on: <br>
    * the data wrangling efforts 
    * data analysis and visualization


## Gathering Data
Data was gathered from 3 different sources:

1. The WeRateDogs Twitter archive file 'twitter_archive_enhanced.csv' was provided in CSV format for download. 
<br>
2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is presented in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL:
https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
<br>
3. Querying the Twitter API for each tweet's JSON data using Python's Tweepy library and stored each tweet's entire set of JSON data in a file called tweet_json.txt file.

## Assessing Data

After gathering the relevant data, the next step was to carry out programmatic and visual assessment. This assessment showed quality and tidiness issues, described below:

### Quality issues
1. Remove retweets from the dataset.

2. Timestamp is in string format and should be in datetime format.

3. Tweet_id is an integer, but should be a string.

4. Many dogs do not have names and there is missing data in the names column.

5. The dog stages columns doggo, floofer, pupper, and puppo contain many null or missing values.

6. img_num column should be in string format.

7. Remove columns with too many missing values: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp

8. The types of dogs in columns p1, p2, and p3 had some uppercase and lowercase letters.

### Tidiness issues
1. doggo, floofer, pupper, puppo these 4 variables should be combined into one categorical variable - dog_type.
2. Merge the dataframe twitter_archive, dataframe image_predictions, and tweet_json dataframes

### Cleaning Data

#### Issue #1:
**Define:** <br>
Merge the json, twitter, and image dataframes into a single dataframe.

**Code** <br>
```df2 = pd.concat([twitter, image, json], join='outer', axis=1)``` <br>

**Test** <br>
```df2.head()``` <br>
```df2.columns``` <br>

The test showed that the 3 dataframes were merged.

#### Issue #2:
**Define:** <br>
Remove retweets from the dataset and retain original tweets only. <br>

**Code** <br>
```df2.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 
           'retweeted_status_timestamp'], axis=1, inplace=True)``` <br>

**Test** <br>
```df2.head()``` <br>
```df2.columns``` <br>

The test showed that the columns were dropped and retweets were removed.

#### Issue #3:
**Define:** <br>
Change img_num to string datatype. <br>

**Code** <br>
```df2['img_num'] = df2['img_num'].astype(str)``` <br>

**Test** <br>
```dtype(df2['img_num'][0])``` <br>

The test showed that the datatype of img_num was changed to string.

#### Issue #4:
**Define:** <br>
Change tweet_id from integer to string. <br>

**Code** <br>
```df2['tweet_id'] = df2['tweet_id'].astype(str) ``` <br>

**Test** <br>
```type(df2.iloc[0,0]) ``` <br>

The test showed that the datatype of img_num was changed to string from integer.

#### Issue #5:
**Define:** <br>
Change timestamp from string to datetime. <br>

**Code** <br>
```df2['timestamp'] = pd.to_datetime(df2['timestamp'], format = "%Y-%m-%d ") ``` <br>

**Test** <br>
```type(df2['timestamp'][0]) ``` <br>

The test confirmed that the timestamp column was in datetime format.

#### Issue #6:
**Define:** <br>
Fix the missing data in the names column and replace the incomplete values with nulls. <br>

**Code** <br>
```df2['name']=df2.name.replace(['None', 'a', 'an', 'very','the', 'not', 'quite', 'actually','by'], 'None') ``` <br>

**Test** <br>
```df2['name'].value_counts() ``` <br>

The test confirmed that the incomplete values in the names column were replaced with 'None'.

#### Issue #7:
**Define:** <br>
Merge the dog stages columns doggo, floofer, pupper to a single column. There are some tweets that contain multiple dog stages, and this data will also need to be cleaned. <br>

**Code** <br>
```df2['dog_type']=df2['doggo']+df2['floofer']+df2['pupper']+df2['puppo']
df2['dog_type'].value_counts()

df2.loc[df2.dog_type == 'doggoNonepupperNone', 'dog_type'] = 'multiple'
df2.loc[df2.dog_type == 'doggoNoneNonepuppo', 'dog_type'] = 'multiple'
df2.loc[df2.dog_type == 'doggoflooferNoneNone', 'dog_type'] = 'multiple'

df2['dog_type'] = df2['dog_type'].str.extract('(doggo|floofer|pupper|puppo|multiple)')

df2.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1, inplace=True)``` 

<br>

**Test** <br>
```df2['dog_type'].value_counts()``` <br>
```df2.columns ``` <br>

The test confirmed that the individual dog stages were merged into a single column - dog_type. Tweets thats contained more than one dog stage now had the dog_type value 'multiple'.

#### Issue #8:
**Define:** <br>
Change the types of dogs in columns p1, p2, and p3 to lowercase letters. <br>

**Code** <br>
```df2['p1'] = df2['p1'].str.lower()
df2['p2'] = df2['p2'].str.lower()
df2['p3'] = df2['p3'].str.lower() ``` <br>

**Test** <br>
``` df2['p1'].unique()
df2['p2'].unique()
df2['p3'].unique()```  <br>

The test confirmed that the values in the columns p1, p2, and p3 were converted to lowercase.

### Storing Data

Storing the cleaned data into a CSV file: <br>
```df2.to_csv('twitter_archive_master.csv', encoding='utf-8', index=False)```


In [1]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_report.ipynb'])

0