## The CSV

In data science we use the phrase "know your data." That’s because it’s important to really know as much about your data as you can
before willy-nilly reading it into memory. You probably don’t want to load all of
the columns into **pandas**.And you might want to specify pandas the type of data that’s in each column, rather than let **pandas** just guess.


Most data sets come with a "data dictionary," a file that describes the columns, their types, their meanings, and their ranges. It’s almost always
worth your while to read a data dictionary when starting to analyze the data.
In many cases, the dictionary will give you insights into the data.

For this first exercise, I want you to create a data frame from the CSV data for January 2019:

- Load the CSV file into a data frame, using only the columns passenger_count, trip_distance, payment_type, and total_amount.

*payment_type is a number describing how the passenger paid for the trip. The most important
values are 1 (credit card) and 2 (cash).*

- How many taxi rides had more than 8 passengers?

- How many taxi rides had zero passengers?

- How many taxi rides were paid for in cash, and cost more than $1,000?

- How many rides cost less than 0?

- How many rides traveled a below-average distance, but cost an above-average amount?

In [1]:
import pandas as pd

# Upload Specified Columns in the dataset
nyc_taxi_df = pd.read_csv('data/nyc_taxi_2020-01.csv', usecols = ['passenger_count', 'trip_distance', 'payment_type', 'total_amount'])

# Preview the data
nyc_taxi_df.head()

Unnamed: 0,passenger_count,trip_distance,payment_type,total_amount
0,1.0,1.2,1.0,11.27
1,1.0,1.2,1.0,12.3
2,1.0,0.6,1.0,10.8
3,1.0,0.8,1.0,8.16
4,1.0,0.0,2.0,4.8


In [3]:
# Describe the dataset
nyc_taxi_df.describe()

Unnamed: 0,passenger_count,trip_distance,payment_type,total_amount
count,6339567.0,6405008.0,6339567.0,6405008.0
mean,1.515333,2.929644,1.270298,18.66315
std,1.151594,83.15911,0.4739985,14.75736
min,0.0,-30.62,1.0,-1242.3
25%,1.0,0.96,1.0,11.16
50%,1.0,1.6,1.0,14.3
75%,2.0,2.93,2.0,19.8
max,9.0,210240.1,5.0,4268.3


In [4]:
# Get information about column data types
nyc_taxi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6405008 entries, 0 to 6405007
Data columns (total 4 columns):
 #   Column           Dtype  
---  ------           -----  
 0   passenger_count  float64
 1   trip_distance    float64
 2   payment_type     float64
 3   total_amount     float64
dtypes: float64(4)
memory usage: 195.5 MB


In [5]:
# How many taxi rides had more than 8 passengers?
nyc_taxi_df.loc[(nyc_taxi_df['passenger_count'] > 8), ['passenger_count']].count()

passenger_count    19
dtype: int64

In [6]:
# How many taxi rides had zero passengers?
nyc_taxi_df.loc[(nyc_taxi_df['passenger_count'] == 0), ['passenger_count']].count()

passenger_count    114302
dtype: int64

In [7]:
# How many taxi rides were paid for in cash, and cost more than $1,000?
nyc_taxi_df.loc[(nyc_taxi_df['payment_type'] == 2) & (nyc_taxi_df['total_amount'] > 1000)]

Unnamed: 0,passenger_count,trip_distance,payment_type,total_amount
471401,1.0,8.27,2.0,1242.3
4049543,1.0,1.57,2.0,4268.3
5059294,1.0,58.85,2.0,1722.3


In [8]:
# How many rides cost less than 0?
nyc_taxi_df.loc[(nyc_taxi_df['total_amount'] < 0)].count()

passenger_count    19441
trip_distance      19505
payment_type       19441
total_amount       19505
dtype: int64

In [9]:
# How many rides traveled a below-average distance, but cost an above-average amount?
nyc_taxi_df.loc[(nyc_taxi_df['trip_distance'] < nyc_taxi_df['trip_distance'].mean()) & (nyc_taxi_df['total_amount'] > nyc_taxi_df['total_amount'].mean())].count()

passenger_count    375673
trip_distance      387832
payment_type       375673
total_amount       387832
dtype: int64

The first thing we need to do to solve this problem is create a new data frame from the CSV file.
Fortunately, the data is formatted in such a way that pd.read_csv will work just fine with its
defaults, returning a data frame with named columns. But this file contains a lot of
data—7,667,792 rides, to be exact—and if we only keep the columns we need, we’ll reduce the
memory footprint by quite a lot. (Indeed, I found that loading only the columns we asked for
reduced the memory usage from 580MB to 200 MB. 

The usecols parameter to pd.read_csv allows us to select which columns from the CSV file
will be kept around. The parameter takes a list as an argument, and that list can either contain
integers (indicating the numeric index of each column) or strings representing the column names.
I generally prefer to use strings, since they’re more readable, and that’s what I did here.

In [10]:
# How many of the rides that cost less than 0 were indeed for either a dispute (
# payment_type of 4) or a voided trip (payment_type of 5)?

nyc_taxi_df.loc[(nyc_taxi_df['total_amount'] < 0) & ((nyc_taxi_df['payment_type'] == 4) | (nyc_taxi_df['payment_type'] == 5))]

Unnamed: 0,passenger_count,trip_distance,payment_type,total_amount
1007,2.0,1.26,4.0,-12.3
1449,1.0,0.03,4.0,-6.3
2148,2.0,0.78,4.0,-10.3
3113,2.0,2.41,4.0,-16.8
4104,1.0,0.02,4.0,-7.8
...,...,...,...,...
6336915,2.0,0.00,4.0,-55.3
6336939,1.0,0.45,4.0,-11.3
6338071,1.0,7.28,4.0,-29.3
6338169,1.0,0.96,4.0,-8.8


In [11]:
# Credit Card vs Cash
nyc_taxi_df.groupby('payment_type')['payment_type'].count()

payment_type
1.0    4694897
2.0    1593834
3.0      32770
4.0      18065
5.0          1
Name: payment_type, dtype: int64

In [12]:
# Credit Card vs Cash
sum_passengers = nyc_taxi_df.groupby('payment_type')['payment_type'].count().sum()

round((nyc_taxi_df.groupby('payment_type')['payment_type'].count() / sum_passengers) * 100, 2)

payment_type
1.0    74.06
2.0    25.14
3.0     0.52
4.0     0.28
5.0     0.00
Name: payment_type, dtype: float64

### Pandemic Taxis

In this exercise, Create a data frame from two different CSV files containing New
York taxi data—one from Jan 2020 (before the pandemic), and a second from July 2021 (near
the height of the pandemic, at least in New York). The data frame should contain three columns
from the files: passenger_count, total_amount, and payment_type. It should also include a
fifth column, month, which should be set to either January or July, depending on the file from which
the data was loaded.

In [13]:
# Load specified Columns from the Jan 2020 Dataset
nyc_jan_2020_taxi_df = pd.read_csv('data/nyc_taxi_2020-01.csv', usecols = ['tpep_pickup_datetime','tpep_dropoff_datetime' , 'passenger_count', 'total_amount', 'payment_type'])

# Load specified Columns from the Jul 2020 Dataset
nyc_jul_2020_taxi_df = pd.read_csv('data/nyc_taxi_2020-07.csv', usecols = ['tpep_pickup_datetime','tpep_dropoff_datetime' , 'passenger_count', 'total_amount', 'payment_type'])


In [14]:
from datetime import datetime as dt

# Convert Date Columns to DateTime
nyc_jan_2020_taxi_df['tpep_pickup_datetime'] = pd.to_datetime(nyc_jan_2020_taxi_df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S')

nyc_jan_2020_taxi_df['tpep_dropoff_datetime'] = pd.to_datetime(nyc_jan_2020_taxi_df['tpep_dropoff_datetime'], format='%Y-%m-%d %H:%M:%S')

# Create a new column with the month
nyc_jan_2020_taxi_df['month'] = nyc_jan_2020_taxi_df['tpep_pickup_datetime'].dt.strftime(date_format='%B')

# Preview the Jan Data
nyc_jan_2020_taxi_df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,payment_type,total_amount,month
0,2020-01-01 00:28:15,2020-01-01 00:33:03,1.0,1.0,11.27,January
1,2020-01-01 00:35:39,2020-01-01 00:43:04,1.0,1.0,12.3,January
2,2020-01-01 00:47:41,2020-01-01 00:53:52,1.0,1.0,10.8,January
3,2020-01-01 00:55:23,2020-01-01 01:00:14,1.0,1.0,8.16,January
4,2020-01-01 00:01:58,2020-01-01 00:04:16,1.0,2.0,4.8,January


In [15]:
# Convert Date Columns to DateTime
nyc_jul_2020_taxi_df['tpep_pickup_datetime'] = pd.to_datetime(nyc_jul_2020_taxi_df['tpep_pickup_datetime'], format='%Y-%m-%d %H:%M:%S')

nyc_jul_2020_taxi_df['tpep_dropoff_datetime'] = pd.to_datetime(nyc_jul_2020_taxi_df['tpep_dropoff_datetime'], format='%Y-%m-%d %H:%M:%S')

# Create a new column with the month
nyc_jul_2020_taxi_df['month'] = nyc_jul_2020_taxi_df['tpep_pickup_datetime'].dt.strftime(date_format='%B')

# Preview the Jan Data
nyc_jul_2020_taxi_df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,payment_type,total_amount,month
0,2020-07-01 00:25:32,2020-07-01 00:33:39,1.0,2.0,9.3,July
1,2020-07-01 00:03:19,2020-07-01 00:25:43,1.0,1.0,27.8,July
2,2020-07-01 00:15:11,2020-07-01 00:29:24,1.0,2.0,22.3,July
3,2020-07-01 00:30:49,2020-07-01 00:38:26,1.0,1.0,14.16,July
4,2020-07-01 00:31:26,2020-07-01 00:38:02,1.0,2.0,7.8,July


With that data in hand, I want you to answer a few questions:
- How many rides were taken in Jan vs. Jul?
- How much money (in total) was collected by taxis in Jan vs. Jul?
- Did the proportion of trips with more than passenger change dramatically?
- Did people use cash less in Jul than in Jan?

In [16]:
# Rides in January vs July

num_of_jan_rides = nyc_jan_2020_taxi_df[['passenger_count']].count()

num_of_jul_rides = nyc_jul_2020_taxi_df[['passenger_count']].count()

# Retrieve Int object from Numpy
print(f'Number of January rides --> {num_of_jan_rides.item()}\nNumber of July rides --> {num_of_jul_rides.item()}')

Number of January rides --> 6339567
Number of July rides --> 737565


In [17]:
# How much money (in total) was collected by taxis in Jan vs. Jul?
tot_amt_coll_Jan = nyc_jan_2020_taxi_df[['total_amount']].sum()

tot_amt_coll_Jul = nyc_jul_2020_taxi_df[['total_amount']].sum()

# Retrieve Int object from Numpy
print(f'Total amount from Rides in January --> {tot_amt_coll_Jan.item():,.2f}\nTotal amount from Rides in July --> {tot_amt_coll_Jul.item():,.2f}')

Total amount from Rides in January --> 119,537,617.35
Total amount from Rides in July --> 14,912,844.09


In [18]:
# Did the proportion of trips with more than passenger change dramatically?

more_than_one_pass_trips_jan = nyc_jan_2020_taxi_df.loc[nyc_jan_2020_taxi_df['passenger_count'] > 1, ['passenger_count']].count()


more_than_one_pass_trips_jul = nyc_jul_2020_taxi_df.loc[nyc_jul_2020_taxi_df['passenger_count'] > 1, ['passenger_count']].count()

# Retrieve Int object from Numpy
print(f'Total number of Rides in January with more than 1 passenger --> {more_than_one_pass_trips_jan.item():,.2f}\nTotal number of Rides in July with more than 1 passenger --> {more_than_one_pass_trips_jul.item():,.2f}')

Total number of Rides in January with more than 1 passenger --> 1,678,039.00
Total number of Rides in July with more than 1 passenger --> 152,050.00


In [19]:
# Did people use cash less in 2020 than in 2019?
jan_cash_payment = nyc_jan_2020_taxi_df.loc[nyc_jan_2020_taxi_df['payment_type'] == 2, ['payment_type']].count()

jul_cash_payment = nyc_jul_2020_taxi_df.loc[nyc_jul_2020_taxi_df['payment_type'] == 2, ['payment_type']].count()
 
 # Retrieve Int object from Numpy
print(f'Total number of Rides in January with cash payment --> {jan_cash_payment.item():,.2f}\nTotal number of Rides in July with cash payment --> {jul_cash_payment.item():,.2f}')

Total number of Rides in January with cash payment --> 1,593,834.00
Total number of Rides in July with cash payment --> 236,433.00


In [20]:
# Correlation Matrix for Jan 2020
nyc_jan_2020_taxi_df.corr()

Unnamed: 0,passenger_count,payment_type,total_amount
passenger_count,1.0,0.01028,0.006659
payment_type,0.01028,1.0,-0.141464
total_amount,0.006659,-0.141464,1.0


In [21]:
# Correlation Matrix for Jul 2020
nyc_jul_2020_taxi_df.corr()

Unnamed: 0,passenger_count,payment_type,total_amount
passenger_count,1.0,-0.000618,0.004711
payment_type,-0.000618,1.0,-0.166798
total_amount,0.004711,-0.166798,1.0


In [22]:
# Concatenate  the 2 dataframes into a single taxis dataframe
taxis_df = pd.concat([nyc_jan_2020_taxi_df, nyc_jul_2020_taxi_df])

taxis_df.sample(10)

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,payment_type,total_amount,month
1782140,2020-01-10 15:04:23,2020-01-10 15:15:07,3.0,1.0,14.76,January
326513,2020-07-16 07:12:36,2020-07-16 07:18:56,1.0,2.0,9.3,July
1607198,2020-01-09 19:44:19,2020-01-09 19:59:29,1.0,1.0,16.3,January
2512384,2020-01-14 06:45:31,2020-01-14 06:58:53,1.0,1.0,17.38,January
5272310,2020-01-27 09:45:12,2020-01-27 10:00:32,1.0,2.0,13.8,January
2567563,2020-01-14 11:47:07,2020-01-14 11:49:54,2.0,1.0,8.55,January
4887242,2020-01-25 10:23:12,2020-01-25 10:27:35,1.0,1.0,11.44,January
586492,2020-01-04 14:14:15,2020-01-04 14:20:35,2.0,1.0,12.3,January
3756326,2020-01-19 21:01:59,2020-01-19 21:04:54,1.0,2.0,7.8,January
2802526,2020-01-15 13:58:28,2020-01-15 14:20:07,2.0,1.0,24.35,January


### Setting Column Data Types

In [30]:
import numpy as np

df = pd.read_csv('data/nyc_taxi_2020-01.csv', 
                 usecols = ['passenger_count', 'payment_type', 'total_amount'],
                 dtype = {'passenger_count': np.float16, 'payment_type': np.float16, 'total_amount': np.float16}
                 )

# df.count to determine which columns might contain NaN
df.count()

passenger_count    6339567
payment_type       6339567
total_amount       6405008
dtype: int64

In [29]:
# Now drop all na and assign to the dataframe

df = df.dropna().copy()

# : to indicate all rows
df.loc[:, 'passenger_count'] = df['passenger_count'].astype(np.int8)

df.loc[:, 'payment_type'] = df['payment_type'].astype(np.int8)




### Working with URLS

Retrieve the dates and values for Bitcoin over the most recent year,
as of when you read this. (For that reason, your results will likely look a bit different from mine,
even if you use the same code to calculate them.) Once you have retrieved this data, I want you
to produce a report showing:
- The closing price for the most recent trading day
- The lowest historical price, and the date of that price
- The highest historical price, and the date of that price

**https://api.blockchain.info/charts/market-price?format=csv**


In [2]:
snp_stock_url = "https://api.blockchain.info/charts/market-price?format=csv"

snp_stock_df = pd.read_csv(snp_stock_url, header=None, names=['date', 'price'])

snp_stock_df.head()

Unnamed: 0,date,price
0,2021-02-12 00:00:00,48013.38
1,2021-02-13 00:00:00,47471.4
2,2021-02-14 00:00:00,47185.19
3,2021-02-15 00:00:00,48720.37
4,2021-02-16 00:00:00,47951.85


In [3]:
# Lets decribe the dataset
snp_stock_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    366 non-null    object 
 1   price   366 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.8+ KB


In [6]:
# The closing price for the most recent trading day
snp_stock_df.tail(1)['price']

365    42401.27
Name: price, dtype: float64

In [8]:
# The lowest historical price, and the date of that price
snp_stock_df.loc[(snp_stock_df['price'] == snp_stock_df['price'].min()), ['date']]

Unnamed: 0,date
159,2021-07-21 00:00:00


In [9]:
# The highest historical price, and the date of that price
snp_stock_df.loc[(snp_stock_df['price'] == snp_stock_df['price'].max()), ['date']]


Unnamed: 0,date
270,2021-11-09 00:00:00
