# Lab 1

In [None]:
import pandas as pd
import numpy as np

## Part 1. Practicing NumPy

### a. Define a new Python *list*

Let us practice defining a new list in Python:

In [None]:
my_list = [0,0,1,2,3,3,4.5,7.6]
my_list

### b. (Two ways to) define a *range*

One type of list is a range of (e.g., integer) numbers. Ranges are useful for iterating over in a loop -- that is, to assign some variable to take on each value in this lest sequentially. (For example, `i=1`, then `i=2`, then `i=3`, etc., until `i=100`.) <br>

Let's create an evenly spaced array of integers ranging from 0 to 11:

In [None]:
#First, the basic python way:
my_range = range(0,12)
my_range

In [None]:
#Now the numpy way:
my_range_np = np.arange(12)
my_range_np

You can see that the first method returns a special 'range' object, while the latter returns an object of type 'numpy array'. If we convert both to lists, they will be the same:

In [None]:
list(my_range)

In [None]:
list(my_range_np)

### c. List comprehension

Consider the task of replacing each value in a list with its square. The traditional way of performing te same transformation on every element of a list is via a *for* loop:

In [None]:
my_range_np = np.arange(12)

for i in range(0, len(my_range_np)):
    my_range_np[i] = my_range_np[i]**2 #square each element
print(my_range_np)

That worked. However, there is a better, more 'Pythonic' way to do it. <br>

*List comprehension* is one of the most elegant functionalities native to Python, and is beloved by programmers like you. It offers a concise way of applying a particular transformation to every element in a list. <br>

Using list comprehension syntax, we can write a single, easily interpretable line that does the same transformation without using ranges, nor do we have to introduce an iterating index variable `i`:

In [None]:
my_range_np = np.arange(12)

my_range_np = [x**2 for x in my_range_np]
my_range_np

### d. Creating a one-dimensional NumPy *array*. 

Let's explicitly create a one-dimensional numpy array (as opposed to a list):

In [None]:
arr = np.array([1, 2, 3, 4])
arr

### e.  Retrieving the dimensions of data structure:  len() and np.shape()

How would we go about creating a variable that contains the length of our array `arr` ?<br>
We could use the python function ```len()```... 

In [None]:
arr_length = len(arr)
arr_length

... or use the numpy method ```np.shape```, saving only the first of the two values that it returns:

In [None]:
arr_length = np.shape(arr)[0]
arr_length

(Tip: try removing the slice indicator `[0]` and see how the printout changes. Notice that there appears to be an empty 'slot' for another number in the returned value pair)

### f. Creating a uniform (same value in every position) array: np.ones()

We will now use ```np.ones``` to create an array of a pre-specified length that contains the value '1' in each position:

In [None]:
length = 55
np.ones(length, dtype=int)

We can use this method to create an array of any identical values. <br>
Let's create an array of length 13, filled with the vlue '7' in every position:

In [None]:
7*np.ones(13, dtype=int)

### g. Creating two-dimensional arrays

Exploring the possibilities of the ```np.array``` method further, let's move on to creating not one- but two-dimensional arrays (aka matrices):

In [None]:
M = np.array([[1,2,3], [4,5,6]])
M

In [None]:
np.shape(M)

Numpy contains useful methods for creating identity matrices of a specified size:

### h. Creating an identity matrix: np.eye()

In [None]:
np.eye(5)

In [None]:
#check your intuition: what will be the printout after running this cell?
A = np.eye(3)
B = 4*np.eye(3)
A+B

### i. A small challenge:  matrix transformation; random matrix generation
Using the matrix ```M``` below and the function ```np.tril``` (pull up the documentation by typing ```np.tril?```), create a matrix ```N``` which is identical to ```M``` except in the lower triangle (i.e., all the cells below the diagonal). The lower triangle should be filled with zeros.

In [None]:
np.tril?

In [None]:
M = np.round(np.random.rand(5,5),2)
print("M=\n", M)

# your code here:
N = 


Using the code provided above for generating the matrix ```M```, try to figure out how to create a random matrix with 13 rows and 3 columns. <br>

In [None]:
# your code here:
M = 


### j. Indexing arrays

Here is how to call an element of a 2D array by its location (i.e., its row index and column index):

In [None]:
M[3,2]

In [None]:
# test your intuition: what would you expect this code to return?
M[3:,2]

### k. Evaluating a Boolean condition

In real-life data tasks, you will often have to compute the boolean ```(True/False)``` value of some statement, for all entries in a list, or for a matrix column (essentially, a list), or for the entire matrix. <br>
In other words, we may want want to formulate a condition -- think of it as a *test* -- and run a computation that returns `True` or `False` depending on whether the test is passed or failed by a particular value in a data structure.

For example, our test may be something like "the value is greater than 0.5", and we would like to know if this is true or false for each of the values in a list. Here's how to compute a list of values that the condition test takes for each value of a given list:

In [None]:
g = np.random.rand(1, 20) #first, create the list 
print(g)

In [None]:
is_greater = g>0.5
print(is_greater)

In [None]:
# Let's print the matrix M again so we can glance at it for our next exercise
print(M)

In [None]:
# What would you expect to see once you run the code below?
c_is_greater = M[:,1]>0.5
c_is_greater

### L . Numpy functions `np.any`, `np.unique `

We can use ```np.any``` to determine if there is any entry in column 1 that is smaller than 0.1:

In [None]:
c_is_smaller = M[:,1]<0.1
np.any(c_is_smaller)

A small challenge:<br>
You have birthday data for a cohort of 100 people all born in 1990.
Given the one-dimensional array of birthdays ```random_bdays``` generated below, figure out if there exists a pair of people who share a birthday.<br>
( Tip: you may find the function ```np.unique()``` useful. Feel free to read up on it by printing `np.unique?` in a new cell )

In [None]:
# do not edit this code:
random_nums = np.random.choice(365, size = 100)
random_bdays = np.datetime64('1990-01-01') + random_nums

## your code here:
duplicates_exist = 


## Part 2. Pandas DataFrames

### a. Creating a DataFrame: two (of the many) ways

We will overview two ways of creating Pandas dataframes from scratch: from a *list of lists*, and from a *dictionary*.

In [None]:
my_list = [['+1', '(929)-000-0000'], ['+34', '(917)-000-0000'], ['+7', '(470)-000-0000']]
df = pd.DataFrame(my_list, columns = ['country_code', 'phone'])
df

In [None]:
my_dict = {'country_code': ['+1', '+34', '+7'], 'phone':['(929)-000-0000', '(917)-000-0000', '(470)-000-0000']}
df = pd.DataFrame(my_dict)
df

### b. Adding a column to a DataFrame object

Below, we add a new column of values to the dataframe:

In [None]:
df['grade']= ['A','B','A']
df

### c. Sorting the dataframe by values in a specific column: `Pandas.df.sort_values`

In [None]:
df.sort_values(['grade'], axis = 0)

In real life settings, you will often need to combine separate sets of related data.<br>
To illustrate, let's create a second dataframe:

### d. Combining multiple dataframes: `Pandas.concat` and `Pandas.merge`. Renaming the columns: `df.rename()`

In [None]:
my_dict2 = {'country': ['+32', '+81', '+11'], 'grade':['B', 'B+', 'A'], 'phone':['(874)-444-0000', '(313)-003-1000', '(990)-006-0660']}
df2 = pd.DataFrame(my_dict2)
df2

Use the Pandas ```pd.concat``` method to append the second dataframe to the first one:

In [None]:
pd.concat([df,df2])

One immediate problem with this result is that the index has repeated values. This defeats the purpose of an index, and ought to be fixed. Let's try the concatenation again, this time adding `reset_index()` method to produce correct results:

In [None]:
pd.concat([df,df2]).reset_index()

Much better! <br>
This result is valid, but it contains redundancy caused by different spelling of the country code column in the two dataframes. This can be easily fixed:

In [None]:
df2 = df2.rename(columns={'country':'country_code'})

pd.concat([df,df2]) # aha!

What if our task was to merge ```df2``` with yet another dataset -- one that contains additional unique columns?

In [None]:
df2

In [None]:
my_dict3 = {'country_code': ['+32', '+44', '+11'], 'phone':['(874)-444-0000', '(575)-755-1000', '(990)-006-0660'], 'grade':['B', 'B+', 'A'], 'n_credits': [12, 3, 9]}
df3 = pd.DataFrame(my_dict3)
df3

Feel free to consult the definition of the Pandas ```merge``` method to better undertsand the next cell:

In [None]:
df2.merge(df3, on = 'phone')

### e. Loading a dataset: `Pandas.read_csv`

We are now well equipped to deal with a real dataset!<br>


For the next few exercises, our dataset will contain information about New York City listings on the Airbnb platform. 

In [None]:
import os 
filename = os.path.join(os.getcwd(), "labs_data", "airbnb_lab1.csv.gz") # path to file and file name
data = pd.read_csv(filename)

In [None]:
data.shape

First, get a peek at the data:

In [None]:
data.head()

That's a lot of columns, and the layout is a little difficult to read. <br>
Let us retrieve just the list of column names, so we can read it and get a feeling for what kind of information is presented in the dataset.

### f. Get column names: `df.columns`

In [None]:
list(data.columns)

What do the column names mean? Some of them are less intitively interpretable than others. <br>
Careful data documentation is indispensable for business analytics. Make sure to consult the documentation that accompanies this open source dataset for a detailed description of the key variable names, what they represent, and how they were generated:  https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896 

### g. Summary statistics of the dataset: `df.describe()`

Let's print some general statistics for each one of the data columns:

In [None]:
data.describe(include='all')

<br>
<br>Consider the following business question:<br>
What is the average availability (out of 365 days in a year) for the listings in Brooklyn? <br>
The answer can be obtained by the use of **filters** on the dataset.

### h. Filtering the data: `df[ < condition > ]`

We need to filter the entries that are in Brooklyn. To do this, we need to know what is the exact way that Manhattan listings are spelled and entered in the data. Let's print all of th eunique values of the 'neighborhood' column:

In [None]:
data['neighbourhood'].unique()

You may have noticed that there is a lot of heterogeneity in the way 'neighborhood' values are specified. The values are not standardized. There are overlaps, redundancies, and inconsistencies (e.g., some entries specify ```'Greenpoint, Brooklyn, New York, United States'```, some other ones list `'BROOKLYN, New York, United States',`, yet other ones say `'Williamsburg, Brooklyn, New York, United States'`, etc. In real life, you would have to clean this data and replace these values with standard, identically formated, consistent values. <br>

For this data file, however, we are lucky to already have a 'cleansed' version of the neighborhood information based on the latitude and the longitude of every listing location. 

We will list the unique values of the columns titled 'neighbourhood_cleansed' and 'neighbourhood_group_cleansed']:

In [None]:
data['neighbourhood_cleansed'].unique()

In [None]:
data['neighbourhood_group_cleansed'].unique()

Great, this last one is what we want! Let's filter out all data entries that pertain to Brooklyn listings:

In [None]:
bk = data[data['neighbourhood_group_cleansed'] == 'Brooklyn']
bk.shape

(Tip: to better understand what happenned above, you are incouraged to insert a new code cell below and copy *just the condition* of the filter that we used on the `data` object above -- that is, everything that we specified inside the brackets for the outermost data[...]. Run the new cell and see what that condition alone evalueates to ---- you should see a series of True/False values. When we pass that series to `data` as a Boolean filter by writing<br> `data[ < our Boolean series > ]`,<br> we tell Pandas to keep the values of `data` only with those indices for which the condition evaluated to `True`. Don't worry! You'll get the hang of this!)

### i. Combining values in a column: `np.mean()`

Now that we isolated only the relevant entries, it remains to average the value of a particular column that we care about:

In [None]:
np.mean(bk['availability_365'])

### j. Group data by (categorical) column values: `df.groupby`

The next question of interest could be:<br>
What are the top 5 most reviewed neighborhoods in New York? (By sheer number of reviews, regardless of their quality). <br>
We will use the ```.groupby``` method from the Pandas package:

In [None]:
nbhd_reviews = data.groupby('neighbourhood_cleansed')['number_of_reviews'].sum()
nbhd_reviews.head()

Perform a (descending order) sorting on this series:

In [None]:
nbhd_reviews = nbhd_reviews.sort_values(ascending = False)
nbhd_reviews.head(5)

Success!<br>
While we re at it, what are the least reviewed neighborhoods?

In [None]:
nbhd_reviews.tail(5)

This result makes it apparent that our dataset is somewhat messy!

Notice we could have chained the transformations above into a single command, as in:

In [None]:
data.groupby('neighbourhood_cleansed')['number_of_reviews'].sum().sort_values(ascending = False).head(5)

This way we don't store objects that we won't need.

### Bonus: easy histogram plotting with Matplotlib: `plt.hist`

As a final touch, run the cell below to instantly visualize the density of (average!) values of review numbers across all neighbourhoods:

In [None]:
%matplotlib inline
nbhd_reviews.hist()

This plot suggests that the vast majority of neighborhoods have only very few reviews, with just a handful of outliers (those ranked at the top in our previous computed cell) having the number of reviews upward of 40000. 