# Data Wrangling

**Brief** 

WeRateDogs is a popular twitter channel that rates people's dogs on a scale of 10.   
They are known for their unique rating system where they would regularly rate dogs with a numerator greater than the denominator.

The task in this process to wrangle WeRateDogs's twitter archive.

**Process**  

The whole wrangling process would be carried out in three parts:

*Gathering*
             
 This stage includes gathering data from three sources 1. File on hand 2. A remote server 3. Twitter API. 

*Assessing*

  Assessing the gathered data for quality and tidiness issues using both: visual assessment and programmatic assessment.
  
*Cleaning*  

  Cleaning the data for the issues identified using the define, code, test framework.
  
*Storing*  

  Storing the cleaned and tidied data into a csv file and a sqlite database.
 
            
                                         


## Gathering Data

**Brief**

All the files gathered for the wrangling are stored in a folder **extracted_files**, which is located a sibling directory to the file wrangle_act.ipynb.

### Gathering twitter_archive_enhanced.csv

* The file contains information about WeRateDogs's tweets: like tweet_id, timestamp, text etc.  
* The text column of the file was used to extract further information like name, stage of the dog, and the rating.
* The twitter_archive_enhanced.csv was downloaded directly from the workspace and is to be treated as file on hand.  

### Gathering  image_predictions.tsv

* This file contains the predictions of a neural network that ran through the tweet images of WeRateDogs. 
* Since the file was stored on a remote server, it was downloaded using the requests library and stored in
  extracted_files

### Gathering retweet_count and favorite_count from twitter API

* Python's tweepy library was used to query twitter's API.   
                                                           
                                                           
* Queried the twitter API by looping through the tweet_id's in twitter_archive_enhanced, stored the json values returned by the API in a text file called tweet_json.txt using the json.dumps method, taking care to append a new line at the end of each iteration.

                 
 * Handled the failed extractions by storing the error message and the corresponding tweet_id in a list called failed extractions.(these tweets were deleted at the time of quering the database)

* Read the json stored in tweet_json.txt using json.loads and extracted attributes retweet_count and favorite_count from the resulting dictionary.   



* Looped through the failed extractions list to extract the tweet_id's of failed extractions and stored them to a dataframe called deleted_tweets by appending the columns retweet_count and favorite_count - both instantiated to np.nan.


* Stored the two attributes along with the corresponding tweet_id and stored it in a dataframe called tweet_attrs.

## Assessing Data

**Brief** 

All the extracted pieces of data would be checked for quality and tidiness issues.  

*Visual assessment*

This stage mainly helped in identifying accuracy and consistency issues. 

*Programmatic assesment*

This stage helped in checking for validity and completness issues.

###  Visual assesment :  primary check for accuracy and tidiness

Individually viewed all the datasets on the jupyter notebook display. In some cases, created specialized views to help in identifying certain issues.  
Example: created a view displaying records where the rating numerator was less than the denominator to identify potential issues; as conventially all dogs would be rated higher than 10. 

Identified all accuracy and tidiness issues using the displays, and some issues with data validity.

###  Programmatic assesment:  primary check for validity and completness
 
 
Wrote python scripts to detect missing values and for the datatypes of columns which helped in detecting issues of incorrect datatypes and missing values. 

*missing values*  
  
  Used pandas' isnull() function, along with seaborne's heatmap function in identifying  missing values. 
  
*incorrect datatypes*  
  
  Used pandas' dtypes() to list out the data type of each column.

##  Cleaning Data

**Brief**

The data cleaning sequence was carried out using the define, code, test framework.

The cleaning sequences were carried out to clean for: 

1. Quality : Issues with contents of the data.
2. Tidiness: Issues with the structure of the data

###  Use of define, code, and test framework

*Define*

A short note on the approach used to solve the cleaning task. 

*Code* 

The cleaning code.

*Test*

A description of the test that would prove the succesfull completion of the cleaning task followed by the code that conducts the test.

## Storing 

**Brief**

The wrangled data would be stored in two ways:

* A directory containing all the cleaned files
* A sqlite database.

**Creating twitter_archive_master**

* twiiter_archive_master is a master dataframe that contains all the data gathered from the project.  
  
  
* The dataset was created by using joining all the frames using tweet_id as the join key while joining the databases.  
  
  
* The join order of the frames was such that the table on the left, had more rows than one on the right. And left join was used which preserved the key order which meant that all the tweet_id's were kept intact. 

###  Storing data  to csv files

* Created a sub-directory: cleaned_dataset_files, to store all the cleaned datasets.


* The master dataframe, and all datasets that were created for the purpose of tidiness are stored in a sub-directory: submission_files that lay inside the cleaned_dataset_directory.


* The cleaned data set directory is structured in the following way:

                    
                    
                    cleaned_dataset_files
                    
                                image_predictions.csv 
                                submission_files 

                                              dog_attrs.csv
                                              tweet_attrs.csv
                                              twitter_archive_enhanced.csv
                                              twitter_archive_master.csv
                     

### Storing data to SQlite database

* Created a SQLite database called submissions using the create_engine function from the SQLAlchemy library. 
  
  
* Stored the following dataframes as tables in the SQLite database:
  
         * twitter_archive_master
         * twitter_archive_enhanced
         * dog_attrs 
         * tweet_attrs
         

* The submissions databse is stored in the folder "Tweet Data Wrangling"