# Session 2 - Introduction to Numpy/Pandas

First, we begin by importing in the library. Typically, it is imported with an abbreviation that makes it easier to call it's functions afterwards.

1. **NumPy:** Python library that is used for efficient numerical computations on large datasets. These computations are done by using special functions and operations that are optimized on the NumPy arrays. 

2. **Pandas:** Python library meant for data analysis and manipulation. The framework in this library is built on top of NumPy and uses the arrays for building series or dataframes. 

In [1]:
import pandas as pd
import numpy as np

## NumPy Arrays

These arrays are more flexible than lists. They can be of similar shape to a list i.e (1 x 3) or they can have more rows and be of similar shape to a matrix. 

In [2]:
lst = [1,2,3,4]

In [3]:
# One-dimensional array
array1 = np.array(lst)
array1

array([1, 2, 3, 4])

In [4]:
# Two-dimensional array
array2 = np.array([lst, lst])
array2

array([[1, 2, 3, 4],
       [1, 2, 3, 4]])

And so on.. 

We can also reshape our initial array into a different vector as long as it is even.

In [5]:
# We can reshape to a 2x2 matrix
array1.reshape((2,2))

array([[1, 2],
       [3, 4]])

In [6]:
# Check dimensions
array2.ndim

2

### Indexing

The syntax of indexing is the same as for lists.

In [7]:
array1[0]

1

### Data Types

Your array can be composed of many different types of data. You can check what the DataType is with a simple function.

In [8]:
array1.dtype

dtype('int64')

Hence, our array is made up of integers. Similarly to how we can change data types of variables, we can change datatypes of arrays using `astype`.

In [9]:
array1.astype('str')

array(['1', '2', '3', '4'], dtype='<U21')

Now, the rest of this tutorial will be focused on how to use pandas. Pandas use the tools from NumPy to analyze large datasets. 

We will begin with an introduction to pandas and then go through the remainder of the lesson focused on applying functions to data from Yahoo Finance for practice.

## Pandas

Pandas is the go-to tool to use for cleaning, transforming and analysing data. It will allow you to create datasets, read in datasets, manipulate them, calculate stummary statistics, remove missing values, etc.

We started with NumPy because Pandas is built on top of the NumPy package so the structure and functions are very similar.

The two main components of pandas are **Series** and **DataFrames.** Functions that apply to Series can be carried over and many of them share similar syntax when applying them to DataFrames.

### Series

A **Series** is a list of values that acts very similarly to a NumPy array. This array is displayed as a single column.

You can let pandas create a default index or specify an index. There are many ways to create a Series.

In [10]:
# Default integer index
pd.Series(array1)

0    1
1    2
2    3
3    4
dtype: int64

In [11]:
# Set an index
pd.Series(array1, index=['a', 'b', 'c', 'd'])

a    1
b    2
c    3
d    4
dtype: int64

In [12]:
# Create series from dictionary
d = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

pd.Series(d)

a    1
b    2
c    3
d    4
dtype: int64

In [13]:
# Check data type
pd.Series(d).dtype

dtype('int64')

### DataFrame

A **DataFrame** on the other hand is a collection of these Series displayed as a multi-dimensional table.

#### Creating DataFrame

In [14]:
# Create a DataFrame from a dictionary
data = {'student_id': [27139,39978,35631,98632,80272,57815,37820,19711,21270,23647],
        'assignment1': [1, 0.85, 0.65, 0.80, 0.75, 0.60, 0.75, 0.92, 0.97, 0.87]
       }

grades = pd.DataFrame(data)

In [15]:
# Create a DataFrame from two Series 
student_id = pd.Series([27139,39978,35631,98632,80272,57815,37820,19711,21270,23647])
a1 = pd.Series([1, 0.85, 0.65, 0.80, 0.75, 0.60, 0.75, 0.92, 0.97, 0.87])

grades2 = pd.DataFrame({'student_id': student_id, 
                        'assignment1': a1})

There are a lot of other methods in which you can create a DataFrame if you are interested.

#### Displaying Values

In [16]:
# Display first five values of DataFrame
grades.head()

Unnamed: 0,student_id,assignment1
0,27139,1.0
1,39978,0.85
2,35631,0.65
3,98632,0.8
4,80272,0.75


In [17]:
# Display last five values of DataFrame 
grades2.tail()

Unnamed: 0,student_id,assignment1
5,57815,0.6
6,37820,0.75
7,19711,0.92
8,21270,0.97
9,23647,0.87


#### Working with columns

In [18]:
# What columns do we have?
grades.columns

Index(['student_id', 'assignment1'], dtype='object')

In [19]:
# Set the index to be a column
grades.set_index(student_id, inplace=True)

# Drop duplicate column
grades.drop(columns=['student_id'], inplace=True)

In [20]:
grades.head()

Unnamed: 0,assignment1
27139,1.0
39978,0.85
35631,0.65
98632,0.8
80272,0.75


In [21]:
# Extract a column
grades['assignment1']

27139    1.00
39978    0.85
35631    0.65
98632    0.80
80272    0.75
57815    0.60
37820    0.75
19711    0.92
21270    0.97
23647    0.87
Name: assignment1, dtype: float64

In [22]:
grades.assignment1

27139    1.00
39978    0.85
35631    0.65
98632    0.80
80272    0.75
57815    0.60
37820    0.75
19711    0.92
21270    0.97
23647    0.87
Name: assignment1, dtype: float64

If there are spaces in your column name, the use the `df[column name]` method. If there are no spaces, then you can use the `df.column_name` method to extract a column

In [23]:
# Extract a column values
grades['assignment1'].values

array([1.  , 0.85, 0.65, 0.8 , 0.75, 0.6 , 0.75, 0.92, 0.97, 0.87])

Notice that when we extract the data values, we have a NumPy array!

Next, let's look at adding a new column in. Let's say you want to add in another assignment into the grades for these 10 students. 

In [24]:
# Extract index
grades.index

Int64Index([27139, 39978, 35631, 98632, 80272, 57815, 37820, 19711, 21270,
            23647],
           dtype='int64')

In [25]:
# Create a new column
assignment2 = pd.Series([0.95, 0.80, 0.78, 0.75, 0.90, 0.85, 0.77, 0.94, 0.88, 0.90], index=grades.index)

grades['assignment2'] = assignment2

grades.head()

Unnamed: 0,assignment1,assignment2
27139,1.0,0.95
39978,0.85,0.8
35631,0.65,0.78
98632,0.8,0.75
80272,0.75,0.9


Now we have two assignments in our data. Note that the index needs to match to be able to add a column into our DataFrame, hence why I set the index to match the previous students.

#### Transforming Data

Next, for transforming data, you can apply simple mathematical functions to your columns, i.e. you can multiple, add, subtract, divide an entire column by a value or by another column.

In [26]:
# Increase from assignment 1 to assignment 2 
grades['diff'] = grades['assignment2'] - grades['assignment1']

grades['diff']

27139   -0.05
39978   -0.05
35631    0.13
98632   -0.05
80272    0.15
57815    0.25
37820    0.02
19711    0.02
21270   -0.09
23647    0.03
Name: diff, dtype: float64

In [27]:
# Multiply all assignment grades by 100 to get them in percentages
grades['assignment1'] *= 100
grades['assignment2'] *= 100
grades['diff'] *= 100

An alternate method would have been to do grades * 100 to get all columns multiplied by 100.

In [28]:
grades.head()

Unnamed: 0,assignment1,assignment2,diff
27139,100.0,95.0,-5.0
39978,85.0,80.0,-5.0
35631,65.0,78.0,13.0
98632,80.0,75.0,-5.0
80272,75.0,90.0,15.0


#### Selecting Data

Now for selecting data, the best methods with DataFrames are `.loc` and `.iloc`. Use `:` if you want to return all the data.

**`.loc`**
* Label- based
* Specify rows and columns by their row and column labels

**`.iloc`**
* Integer-based
* Specify rows and columns based on their integer position (starts at 0)

To understand this better, let's say you wanted to get the assignment 2 grade for student with ID number 5.

In [29]:
grades.loc[98632, 'assignment2']

75.0

In [30]:
# Equivalent iloc statement
grades.iloc[3, 1]

75.0

In [31]:
# Use .loc with a column and slicing
grades['assignment1'].loc[27139:98632]

27139    100.0
39978     85.0
35631     65.0
98632     80.0
Name: assignment1, dtype: float64

In [32]:
# Get all rows
grades.loc[:, 'assignment1']

27139    100.0
39978     85.0
35631     65.0
98632     80.0
80272     75.0
57815     60.0
37820     75.0
19711     92.0
21270     97.0
23647     87.0
Name: assignment1, dtype: float64

In [33]:
# Get all columns
grades.loc[27139, :]

assignment1    100.0
assignment2     95.0
diff            -5.0
Name: 27139, dtype: float64

Now, let's say that student 27139 was incorrectly given a 100, we want to replace their assignment 1 grade with a 98.

In [34]:
# Replace 100 with 98
grades['assignment1'].loc[27139] = 98

In [35]:
# Check values for student 27139
grades.loc[27139]

assignment1    98.0
assignment2    95.0
diff           -5.0
Name: 27139, dtype: float64

We can replace values by using .loc or .iloc to select their specific value and then re-assign the value.

Another important concept with selecting data is the idea of **masking**. This is essentially filtering your data based on the columns.

`dataframe[mask]` 

In [36]:
# Assignment1  less than 50
grades[grades.assignment1 <= 50]

Unnamed: 0,assignment1,assignment2,diff


No students failed the assignment! 

In [37]:
# Filtering with AND 
# Assignment 1 between 60 and 70
grades[(grades.assignment1 < 70) & (grades.assignment1 >= 60)]

Unnamed: 0,assignment1,assignment2,diff
35631,65.0,78.0,13.0
57815,60.0,85.0,25.0


In [38]:
# Filtering with OR 
# Assignment 1 below 70 or Assignment 2 below 70
grades[(grades.assignment1 < 70) | (grades.assignment2 < 70)]

Unnamed: 0,assignment1,assignment2,diff
35631,65.0,78.0,13.0
57815,60.0,85.0,25.0


In [39]:
# Filtering based on equality
grades[grades.assignment1 == 75]

Unnamed: 0,assignment1,assignment2,diff
80272,75.0,90.0,15.0
37820,75.0,77.0,2.0


In [40]:
# Using filters with loc
grades.loc[(grades.assignment1 < 70) | (grades.assignment2 < 70)]

Unnamed: 0,assignment1,assignment2,diff
35631,65.0,78.0,13.0
57815,60.0,85.0,25.0


In [41]:
# Get the diff column with loc 
grades.loc[(grades.assignment1 < 70) | (grades.assignment2 < 70), 'diff']

35631    13.0
57815    25.0
Name: diff, dtype: float64

## Quick Intro to Stats

We will go over some very basic statistics for now, and then we can go over more complex statistics depending on the project you choose.

**Mean/Average:** This is the sum of all values over the number of values

In [42]:
grades.assignment1.mean()

81.4

In [43]:
# Numpy Functions
np.mean(grades.assignment1)

81.4

**Maximum:** Largest value in your dataset

In [44]:
grades.assignment1.max()

98.0

**Minimum:** Smallest value in your dataset

In [45]:
grades.assignment1.min()

60.0

**Variance:** Measure to describe the spread between numbers in your dataset and the mean of the dataset. This is calcualted by summing the squared differences in the numbers in your dataset from the mean. 

**Standard Deviation:** Measure of how spread out the numbers in the dataset are. This takes the square root of the variance so it is back in terms of your original unit.

In [46]:
grades.assignment1.var()

165.15555555555557

This number is very large and above 100 which doesn't make sense since grades are between 0-100 so we take the standard deviation to scale it back.

In [47]:
grades.assignment1.std()

12.851286144022923

**Mode:** Most common value in the dataset

In [48]:
grades.assignment1.mode()

0    75.0
dtype: float64

Our mode in this case is 75 since we have that the value appears twice. One way to get the distribution of the counts of our data is to use `df.column_name.value_counts()`.

In [49]:
grades.assignment1.value_counts()

75.0    2
98.0    1
85.0    1
65.0    1
80.0    1
60.0    1
92.0    1
97.0    1
87.0    1
Name: assignment1, dtype: int64

**Median:** Value that indicates the "middle" of a dataset
* This can be found by ordering the dataset from smallest to largest and finding the middle value 
* If the length of a dataset is an even number, we take the average of the two middle numbers

In [50]:
grades.assignment1.median()

82.5

**Percentiles:** This is a number where a certain percentage of your values fall below that number. 
* 25th Percentile: Let's call this x, then 25% of your values would be less than x

We typically look at the 25th, 50th and 75th percentile values. The median is also known as the 50th percentile.

In [51]:
# 25th percentile
np.percentile(grades.assignment1, 25)

75.0

In [52]:
# 50th percentile
np.percentile(grades.assignment1, 50)

82.5

In [53]:
# 75th percentile
np.percentile(grades.assignment1, 75)

90.75

**Summary Statistics:** This is a compilation of the statistics to provide some insights into the data. The five number summary consists of:
* Minimum
* 25th Percentile
* 50th Percentile (Median)
* 75th Percentile
* Maximum

In [54]:
# Summary statistics
grades.describe()

Unnamed: 0,assignment1,assignment2,diff
count,10.0,10.0,10.0
mean,81.4,85.2,3.6
std,12.851286,7.284687,10.864826
min,60.0,75.0,-9.0
25%,75.0,78.5,-5.0
50%,82.5,86.5,2.0
75%,90.75,90.0,10.5
max,98.0,95.0,25.0


You can see that these values match with the percentiles we calculated above. Lastly, we are interested in understanding the range of our data. 

**Range:** The difference between the maximum and minimum of your data

**Interquartile Range:** The difference between your 75th and 25th percentiles.

In [55]:
# Range
grades.assignment1.max() - grades.assignment1.min()

38.0

In [56]:
# IQR
np.percentile(grades.assignment1, 75) - np.percentile(grades.assignment1, 25)

15.75

This should give you an introduction to how to create DataFrames, manipulate them, transform them for the first assignment. We will cover soem more topics about DataFrames in further tutorials. Next week, we will learn about how to visualize your data using `matplotlib`.

# Exercises

## Grades Exercises

### Exercise 1: 
Calculate the average grade of the students based on the two assignments

### Exercise 2: 
Get a subset of the students that are above a 85 average. 

### Exercise 3: 
Get the averages of students that have an average above the median average. 

## Yahoo Finance Exercises

### Exercise 1: 
Get the stock data for 'UBER' from 2020-01-01 to 2021-01-01 and 'LYFT' from 2020-01-01 to 2021-01-01

### Exercise 2:
Extract a subset of the data that shows the closing prices only for both stocks and store that in a new DataFrame. 

### Exercise 3:
Get the summary statistics for both stocks closing prices. 