# CS3PP19 - Programming in Python for Data Science - Practical 2

## Pandas & NumPy

Follow the instructions to complete each of these tasks. This set of exercises focuses on working with Python's Pandas library.

**Questions marked with a * are extra challenging**

The relevant materials for these exercises are lectures Lectures 4, 5 and 6 (NumPy and Pandas).

This is not assessed but will help you gain practical experience for the exam and coursework.

You will need to download some of the csv data set files from the module Blackboard page and place them in the same folder as this notebook. Run the cell below to load all of the necessary Python modules.

### PANDAS

In [2]:
import pandas as pd
import requests
import numpy as np
from pandas.io.json import json_normalize

## 1. Diamonds example data

1.1. Read in the diamonds csv file to a pandas data frame. Use pandas to find how many diamonds have carat greater than 3.5.

In [3]:
diamonds = pd.read_csv("data/diamonds.csv")
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


<hr style="border:2px solid black"> </hr>

1.2. Create a series of the price of all of the diamonds that have carat greater than 3.5.

In [5]:
greatCarat = diamonds[diamonds['carat']>3.5]
greatCarat.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
23644,3.65,Fair,H,I1,67.1,53.0,11668,9.53,9.48,6.38
25998,4.01,Premium,I,I1,61.0,61.0,15223,10.14,10.1,6.17
25999,4.01,Premium,J,I1,62.5,62.0,15223,10.02,9.94,6.24
26444,4.0,Very Good,I,I1,63.3,58.0,15984,10.01,9.94,6.31
26534,3.67,Premium,I,I1,62.4,56.0,16193,9.86,9.81,6.13


<hr style="border:2px solid black"> </hr>

1.3. For ideal cut diamonds whose price is greater than 10000, find the number of diamonds having each clarity.

In [19]:
greatAndPricey = diamonds[(diamonds['price']>1000) & (diamonds['cut']=='Ideal')]
greatAndPricey.groupby(['clarity'])['carat'].count()

clarity
I1       135
IF       643
SI1     3129
SI2     2222
VS1     2425
VS2     3227
VVS1    1229
VVS2    1690
Name: carat, dtype: int64

<hr style="border:2px solid black"> </hr>

## 2. Vancouver street trees data

2.1. Load the Vancouver street trees data provided on Blackboard. What is the most common genus of tree?

<hr style="border:2px solid black"> </hr>

2.2. Find the mean diameter of trees with height range ID 9.

<hr style="border:2px solid black"> </hr>

2.3. Produce a pandas data frame giving the maximum and minimum height range id on each street.


<hr style="border:2px solid black"> </hr>

## 3. Iris flower example data

3.1. Load the iris.csv flower data. Add two extra columns to the data frame giving the ratio of sepal length over width and petal length over width.



<hr style="border:2px solid black"> </hr>

3.2. Calculate the mean of the ratio between sepal length and width for each species.

<hr style="border:2px solid black"> </hr>


3.3. Perform a data discovery on the dataset.
- How many classes are?
- What is the distribution of the classes?
- What are the characteristic of the data in general/per class?
You can use methods like unique and describe.


<hr style="border:2px solid black"> </hr>

## 4 Philadelphia bike share live data

4.1 Complete the code below to load a JSON live feed for a Philadelphia bike share program into a pandas data frame. It may help to look at the JSON data in a visual inspector. One way of doing this is to open the url given in Firefox. Once you have loaded the data, look at the head of the data frame, and list all of the columns.

(I've had to add in some header data to the request, as the server rejects all requests without a user agent string)

*You can use the pandas function json_normalize that was imported at the start of the notebook, but you need to pass it a suitable part of the JSON data.* 

*The indego_bikes_data object returned by requests.get() can be converted to a Python data structure using the json() method*

https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html

In [None]:
indego_bikes_url = ("https://www.rideindego.com/stations/json/")
headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0"}
indego_bikes_data = requests.get(indego_bikes_url,headers=headers)


In [None]:
##To display the data you can use the following line
indego_bikes_data.json()


<hr style="border:2px solid black"> </hr>

4.2. Is there any street with more than one bike station?

<hr style="border:2px solid black"> </hr>

4.3. Use pandas to count the total number of available docks in each zip code, producing a Series of zipcodes and available dock counts.

*You can use the pandas method sum() on a grouby object to add the all values in a particular group*.

<hr style="border:2px solid black"> </hr> 

__4.4*__. Using pandas, find the difference between the minimum and maximum number of available bikes at docks within each zip code.

__4.5*__. Write Python code using pandas to determine the postal code with the highest median of docks available.

<hr style="border:2px solid black"> </hr> 

## 5 Bikes Dataset

Load the bikes.csv file into a pandas data frame. Using the DataFrame method isnull(), you can produce a DataFrame where each value is either True if the value is  missing, or False if it is present.


5.1. Produce a count of the number of missing values in each column in the DataFrame.

*Tip - the sum() method treats True and False values as 1 and 0 respectively https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html*

<hr style="border:2px solid black"> </hr> 

5.2. Think of a sensible way of removing the missing values and use this to create a new DataFrame with no missing values. You can use the copy() method to duplicate a DataFrame before modifying it.

<hr style="border:2px solid black"> </hr> 

5.3. Use the describe() method to calculate statistics of the columns in the data frame. Is there anything strange about the values in a column?

<hr style="border:2px solid black"> </hr> 

5.5. Convert the Start Time and End Time columns to pandas datetime objects. You can do this using the pd.to_datetime method on those columns.

Create a new column in the data frame that gives the day of the week the journey was started on. You can extract the day of the week from a datetime object using the .dayofweek attribue, and use the apply method of a column in a DataFrame to apply a function to each value in the column.

<hr style="border:2px solid black"> </hr> 

5.6. Create a new column in the data frame giving the (approximate) age in years of the user for each journey. 

<hr style="border:2px solid black"> </hr> 

5.7. Investigate the numbers of journeys starting at each hour of the day. You can use the hour attribute of a pandas datetime object to extract the hour of the day from the starting times.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.hour.html

<hr style="border:2px solid black"> </hr> 

__5.8*__. Use the pandas cut() function to create a new column in the data frame that assigns an age range of the user for each journey. Use this new column to visualise the relationship between age group and duration of journeys.

You can find documentation on the cut function here - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

<hr style="border:2px solid black"> </hr> 

## NUMPY

1.1. Write a line of code in the locations indicated to test whether the numpy array contains 0.

In [None]:
import numpy as np
x = np.array([1, 2, 3, 4])
print("Original array:")
print(x)
print("Test if none of the elements of the said array is zero:")
# insert your code here

x = np.array([0, 1, 2, 3])
print("Original array:")
print(x)
print("Test if none of the elements of the said array is zero:")
# insert your code here

<hr style="border:2px solid black"> </hr> 

1.2. Insert code to test whether the array below contains any NaNs or infite numbers.

In [None]:
a = np.array([1, 0, np.nan, np.inf])
print("Original array")
print(a)
print("Test a given array element-wise for finiteness :")
# insert your code here

<hr style="border:2px solid black"> </hr> 
1.3. Insert code to create a 3x3 identity matrix i.e. diagonal elements are 1, the rest are 0.

<hr style="border:2px solid black"> </hr> 

1.4. Write code to generate an array of 10 random numbers from a normal distribution.

<hr style="border:2px solid black"> </hr> 

__1.5*__. Write code to compute the coordinates for points on a cosine curve and plot the points using matplotlib __(Optional)__

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np

1.6. Create a 4x4 matrix with values ranging from 1 to 16