# STAGE 3: PROCESS - Getting to Know the Data
## Getting to Know the Data and Define Tasks

- What tools are you choosing and why?
- Have you ensured your data’s integrity?
- What steps have you taken to ensure that your data is clean?
- How can you verify that your data is clean and ready to analyze?
- Have you documented your cleaning process so you can review and share those results?


## Key tasks
Now that you know your data is credible and relevant to your problem, you’ll need to clean it so that your analysis will be error-free.
- Check the data for errors
- Transform the data into the right type
- Document the cleaning process
- Choose your tools


## Process Data
### What tools are you choosing and why?
#### Python
Since I am familiar with python script, I deicide to use python to carry out my programming tasks. I choose the less challenging tool because I really want to focus on documenting the data analytic process. Jupyter notebook will be my reference document for the future. I hope it will be helpful for people who did not go through the whole certification process. Or because of it, some of you choose to take the course and enjoy the benefit of the what I think very comprehensive and helpful course material.

### Data Exploration

csv files are located at subfolder "./Data/csv". 

The first task is to get to know my data. The datatypes are defined for each column.

By looking at the first few csv files, this is the interpretations of what they are. 
So the interpretations are to be further verified against the data in detail. It is likely some clean up and data alignment will be necessary.

   * ride_id                       object: each trip is anonymized with an unique id
   * rideable_type               category: Divvy bike types
   * started_at            datetime64[ns]: Trip start datetime
   * ended_at              datetime64[ns]: Trip end datetime
   * start_station_name          category: Trip start station name
   * start_station_id            category: Trip start station id
   * end_station_name            category: Trip end station name
   * end_station_id              category: Trip end station id
   * start_lat                    float64: latitude of start station
   * start_lng                    float64: longitude of start station
   * end_lat                      float64: latitude of end station
   * end_lng                      float64: longitude of end station
   * member_casual               category: rider type (casual or member)

Here I define the datatypes for the data columns to make sure they are loaded to dataframes with the right format for future process. 
keep in mind I performed read_csv prior to this to get the list of header (column) names.
from this [article ](https://drawingfromdata.com/pandas/concat/memory/exploding-memory-usage-with-concat-and-categories.html), I learned that it is a good idea to use **categorical** data type when loading known categorical columns.

### Checking out one csv file on Memory Optimization

In [1]:
# str vs categorical
dtypes0 = {'ride_id': 'str', 'rideable_type': 'str', 'start_station_name': 'str', 'start_station_id': 'str', 'end_station_name':'str', 'end_station_id': 'str', 'member_casual':'str'}
dtypes1 = {'ride_id': 'str', 'rideable_type': 'category', 'start_station_name': 'category', 'start_station_id': 'category', 'end_station_name':'category', 'end_station_id': 'category', 'member_casual':'category'}

In [2]:
#load the csv file names
import pandas as pd
file_list_df = pd.read_csv('file_list_2020.csv', header=None, names= ['filename'])
file_list = file_list_df['filename'].values

In [3]:
# read just the first csv file
df0= pd.read_csv('./Data/csv/'+file_list[0], parse_dates=['started_at','ended_at'], dtype = dtypes0)
df1= pd.read_csv('./Data/csv/'+file_list[0], parse_dates=['started_at','ended_at'], dtype = dtypes1)

In [4]:
# Compute memory uses
m0 = df0.memory_usage(deep=True).sum()/1e+6 # in Megabyte
m1 = df1.memory_usage(deep=True).sum()/1e+6
print ('memory usage for dataframe with "Str" as datatype', m0 ,' Megabyte')
print ('memory usage for dataframe with "category" as datatype', m1, ' Megabyte')

memory usage for dataframe with "Str" as datatype 45.228412  Megabyte
memory usage for dataframe with "category" as datatype 11.341209  Megabyte


In [5]:
# confirm the data types
df1.dtypes

ride_id                       object
rideable_type               category
started_at            datetime64[ns]
ended_at              datetime64[ns]
start_station_name          category
start_station_id            category
end_station_name            category
end_station_id              category
start_lat                    float64
start_lng                    float64
end_lat                      float64
end_lng                      float64
member_casual               category
dtype: object

In [None]:
## It is proven that by declaring datatypes as category, the memory usage is greatly reduced. 