## Gather

In [1]:
import pandas as pd
import pickle

In [2]:
# WeRateDogs Twitter archive, downloaded manually  
twitter_archive_enhanced_df = pd.read_csv('twitter_archive_enhanced.csv')

In [3]:
twitter_archive_enhanced_clean_df = twitter_archive_enhanced_df.copy()

In [4]:
# make sure we can read it
print('twitter_archive_enhanced_clean_df.info()')
print(twitter_archive_enhanced_clean_df.info())

twitter_archive_enhanced_clean_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-nul

## Assess

Requirement compliance  
<br>
Data Analyst Nanodegree Program  
8\. Data Wrangling  
Project: Wrangle and Analyze Data  
2\. Project Motivation  
Key Points  
You only want original ratings (no retweets) 

The retweeted_status_id column, pandas series was selected because  
* domain knowledge
  * RetweetedStatusId - The unique identifier for the original Tweet if this is a retweet, otherwise it is null.

visual assessment - retweets identified in twitter_archive_enhanced_clean_df

## Clean

### Define

remove retweet observations, rows - project requirement compliance

### Code

In [5]:
if 'retweeted_status_id' in twitter_archive_enhanced_clean_df.columns:
    twitter_archive_enhanced_clean_df = \
    twitter_archive_enhanced_clean_df\
    [pd.isnull(twitter_archive_enhanced_clean_df['retweeted_status_id'])]

### Test

In [6]:
twitter_archive_enhanced_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2175 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2175 non-null object
source                        2175 non-null object
text                          2175 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2117 non-null object
rating_numerator              2175 non-null int64
rating_denominator            2175 non-null int64
name                          2175 non-null object
doggo                         2175 non-null object
floofer                       2175 non-null object
pupper                        2175 non-null object
puppo                         2175 non-null object
dtypes: float64(4), int64(3), object(1

Success  
retweet observations, rows removed  
requirement compliance achieved  
math correct: 2356 - 2175 = 181 retweet observations, rows removed compliant with requirements

## Assess

retweet pandas series, columns no longer needed  
* retweeted_status_id
* retweeted_status_user_id
* retweeted_status_timestamp

## Clean

### Define
* drop retweet pandas series, columns no longer needed
  * retweeted_status_id
  * retweeted_status_user_id
  * retweeted_status_timestamp

### Code

In [7]:

if 'retweeted_status_id' in twitter_archive_enhanced_clean_df.columns:
    twitter_archive_enhanced_clean_df = \
    twitter_archive_enhanced_clean_df.drop(['retweeted_status_id'], axis = 1)

if 'retweeted_status_user_id' in twitter_archive_enhanced_clean_df.columns:
    twitter_archive_enhanced_clean_df = \
    twitter_archive_enhanced_clean_df.drop(['retweeted_status_user_id'], axis = 1)

if 'retweeted_status_timestamp' in twitter_archive_enhanced_clean_df.columns:
    twitter_archive_enhanced_clean_df = \
    twitter_archive_enhanced_clean_df.drop(['retweeted_status_timestamp'], axis = 1) 

### Test

In [8]:
twitter_archive_enhanced_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                 2175 non-null int64
in_reply_to_status_id    78 non-null float64
in_reply_to_user_id      78 non-null float64
timestamp                2175 non-null object
source                   2175 non-null object
text                     2175 non-null object
expanded_urls            2117 non-null object
rating_numerator         2175 non-null int64
rating_denominator       2175 non-null int64
name                     2175 non-null object
doggo                    2175 non-null object
floofer                  2175 non-null object
pupper                   2175 non-null object
puppo                    2175 non-null object
dtypes: float64(2), int64(3), object(9)
memory usage: 254.9+ KB


Success
* dropped retweet pandas series, columns no longer needed
  * retweeted_status_id
  * retweeted_status_user_id
  * retweeted_status_timestamp

## Assess

* visual assessment -
* tweet_id pandas series, column - dtype('int64'), not dtype('O')

In [9]:
twitter_archive_enhanced_clean_df.tweet_id.dtypes

dtype('int64')

## Clean

### Define

change tweet_id dtype from dtype('int64'), to dtype('O')

### Code

In [10]:
twitter_archive_enhanced_clean_df.tweet_id = \
twitter_archive_enhanced_clean_df.tweet_id.apply(str)

### Test

In [11]:
twitter_archive_enhanced_clean_df.tweet_id.dtypes

dtype('O')

Success  
tweet_id pandas series, column dtype changed to dtype('O') from dtype('int64')

## Assess

* visual assessment -
* in_reply_to_status_id pandas series, column - dtype('float64'), not dtype('O')

In [12]:
twitter_archive_enhanced_clean_df.in_reply_to_status_id.dtypes

dtype('float64')

## Clean

### Define

change in_reply_to_status_id dtype from dtype('float64'), to dtype('O')

### Code

In [13]:
if twitter_archive_enhanced_clean_df.in_reply_to_status_id.dtypes == 'float64':
    twitter_archive_enhanced_clean_df.in_reply_to_status_id = \
    twitter_archive_enhanced_clean_df.in_reply_to_status_id.apply\
    (lambda x: "{:.0f}".format(x) if not pd.isnull(x) else x)

### Test

In [14]:
twitter_archive_enhanced_clean_df.in_reply_to_status_id.dtypes

dtype('O')

Success  
in_reply_to_status_id pandas series, column dtype changed to dtype('O') from dtype('float64')

## Assess

* visual assessment -
* in_reply_to_user_id pandas series, column - dtype('float64'), not dtype('O')

In [15]:
twitter_archive_enhanced_clean_df.in_reply_to_user_id.dtypes

dtype('float64')

## Clean

### Define

change in_reply_to_user_id dtype from dtype('float64'), to dtype('O')

### Code

In [16]:
if twitter_archive_enhanced_clean_df.in_reply_to_user_id.dtypes == 'float64':
    twitter_archive_enhanced_clean_df.in_reply_to_user_id = \
    twitter_archive_enhanced_clean_df.in_reply_to_user_id.apply\
    (lambda x: "{:.0f}".format(x) if not pd.isnull(x) else x)

### Test

In [17]:
twitter_archive_enhanced_clean_df.in_reply_to_status_id.dtypes

dtype('O')

Success  
in_reply_to_user_id pandas series, column dtype changed to dtype('O') from dtype('float64')

## Assess

assessment - timestamp - non null object, not datetime64[ns]

In [18]:
twitter_archive_enhanced_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                 2175 non-null object
in_reply_to_status_id    78 non-null object
in_reply_to_user_id      78 non-null object
timestamp                2175 non-null object
source                   2175 non-null object
text                     2175 non-null object
expanded_urls            2117 non-null object
rating_numerator         2175 non-null int64
rating_denominator       2175 non-null int64
name                     2175 non-null object
doggo                    2175 non-null object
floofer                  2175 non-null object
pupper                   2175 non-null object
puppo                    2175 non-null object
dtypes: int64(2), object(12)
memory usage: 254.9+ KB


## Clean

### Define

change timestamp type from non null object to datetime64[ns]

### Code

In [19]:
twitter_archive_enhanced_clean_df.timestamp = \
pd.to_datetime(twitter_archive_enhanced_clean_df.timestamp)

### Test

In [20]:
twitter_archive_enhanced_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                 2175 non-null object
in_reply_to_status_id    78 non-null object
in_reply_to_user_id      78 non-null object
timestamp                2175 non-null datetime64[ns]
source                   2175 non-null object
text                     2175 non-null object
expanded_urls            2117 non-null object
rating_numerator         2175 non-null int64
rating_denominator       2175 non-null int64
name                     2175 non-null object
doggo                    2175 non-null object
floofer                  2175 non-null object
pupper                   2175 non-null object
puppo                    2175 non-null object
dtypes: datetime64[ns](1), int64(2), object(11)
memory usage: 254.9+ KB


In [21]:
twitter_archive_enhanced_clean_df.head(1).timestamp

0   2017-08-01 16:23:56
Name: timestamp, dtype: datetime64[ns]

In [22]:
print(twitter_archive_enhanced_clean_df.timestamp.iloc[0].month)
print(twitter_archive_enhanced_clean_df.timestamp.iloc[0].minute)

8
23


Success  
* timestamp changed to datetime64[ns] from not non-null object 
* correct month and minute extracted from datetime64[ns]

In [23]:
# Save cleaned pandas DataFrame
with open('twitter_archive_enhanced_clean_df.pkl', 'wb') as f:
    pickle.dump(twitter_archive_enhanced_clean_df, f)


In [24]:
# Make sure we can read it
with open('twitter_archive_enhanced_clean_df.pkl', 'rb') as f:
    twitter_archive_enhanced_clean_df = pickle.load(f)

print('twitter_archive_enhanced_clean_df.shape')
print(twitter_archive_enhanced_clean_df.shape)
print()

twitter_archive_enhanced_clean_df.shape
(2175, 14)

