# Introducing Libraries: NumPy

![libgif](https://media0.giphy.com/media/7E8lI6TkLrvvAcPXso/giphy.gif?cid=790b76115d360a95792e4333770609b8&rid=giphy.gif)

## Introduction

#### Our goals today are to be able to: <br/>

- Identify and import Python modules and packages (libraries)
- Identify differences between NumPy and base Python in usage and operation
- Create a new module of our own
- Investigate table data in Pandas
- Manipulate Pandas DataFrames

### Activation:

![excel2](img/excelpic2.jpg)

Most people have used Microsoft Excel or Google sheets. But what are the limitations of excel?

- [Take a minute to read this article](https://www.bbc.com/news/magazine-22223190)
- make a list of problems excel presents

How is using python different?

Python
- create documentation of processes as you code
- reduces chances for human error
- do "drag and drop"
- repeatable
- transparent

## 1. Importing Python packages 


In an earlier lesson, we wrote a function to calculate the mean of an list. That was **tedious**. To make our code efficient we could store that function in a *python module* and call it later when we need it. 

And thankfully, other people have _also_ written and optimized functions and wrapped them into **modules** and **packages** (also known as _libraries_ )


### Terminology

![mod2](img/modules2.png)

### Terminology

![packages3](img/packages3.png)

<img src="img/python_def.png" width=400>

### pip & the Python Package Index

<img src="img/pypi_packages.png" width=600>

### You're not limited to PyPI

Make your own modules
![pipmod](img/import_modules.png)

![pippack](img/package_redo.png)

## Activity

In your group, look up the top used python packages for data science.  Pick 3 that you haven't heard of before and look at their documentation.  What is this package used for?

### First library we will import is `Numpy`


![numpy](https://raw.githubusercontent.com/donnemartin/data-science-ipython-notebooks/master/images/numpy.png)

[NumPy](https://www.numpy.org/), short for _numerical python_,  is the fundamental package for scientific computing with Python. 

Some advantages of numpy are:
- efficient multidimensional array operations
- mathematical functions on arrays without using for loops
- tools for reading and writing data
- linear algebra capabilities
- a C API to allow connectivity with lower level languages

#### Importing and Aliasing Packages

To import a package type `import` followed by the name of the package as shown below.

Many packages have a canonical way to import them with an abbreviated alias.

In [1]:
import numpy as np # np = alias

#### Other standard aliases 

In [2]:
import scipy
import pandas as pd
import matplotlib as mpl
import statsmodels as sm

### Import specific modules from a larger package

In [3]:
# sometimes we will want to import a specific module from a package
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt 

In [4]:
from sklearn.preprocessing import StandardScaler

In [5]:
ss = StandardScaler()

In [6]:
ss2 = StandardScaler()

#### Helpful links: package documentation

Packages have associated documentation to explain how to use the different tools included in a package.

_Sample of libraries_
- [NumPy](https://docs.scipy.org/doc/numpy/)
- [SciPy](https://docs.scipy.org/doc/scipy/reference/)
- [Pandas](http://pandas.pydata.org/pandas-docs/stable/)
- [Matplotlib](https://matplotlib.org/contents.html)

## 2. NumPy versus base Python

Now that we know packages exist, why do we want to use them? Let us examine a comparison between base Python and Numpy.

Python has lists and normal python can do basic math. NumPy, however, has the helpful objects called **arrays**.

Numpy has a few advantages over base Python which we will look at.

In [7]:
l = [1,2,3]
x=np.array([1,2,3])
print(x)


y=np.array([4,5,6])
print(y)

[1 2 3]
[4 5 6]


#### New type of object

In [8]:
type(x)

numpy.ndarray

In [9]:
x.dtype

dtype('int64')

### Calculating the mean using pure python


```python
samp_list = [1,1,1,1,2,2,2,3,3,10,44]
```

How could you write a for loop to calculate the mean?

In [10]:
samp_list = [1, 1, 1, 1, 2, 2, 2, 3, 3, 10, 44]

In [11]:
# your code here

### Numpy makes math easy

Because of numpy we can now get the **mean** and other quick math of lists and arrays.

In [13]:
example = [4,3,25,40,62,20]
print(np.mean(example))

25.666666666666668


#### Array math in action

In [14]:
# Make a list and an array of three numbers

#list

#array


In [16]:
# divide your array by 2


In [18]:
# divide your list by 2

##### Now try multiplying your list and array by 2.  What do you notice?

In [20]:
#multiply list by 2


In [22]:
#multiply array by 2

Numpy arrays support the `div()` operator while python lists do not. There are other things that make it useful to utilize numpy over base python for evaluating data.

In [24]:
# shape tells us the size of the array

numbers_array.shape

(4,)

### More Array Math 
#### Adding matrices 

In [25]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
print(x)
print(y)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)

[[1. 2.]
 [3. 4.]]
[[5. 6.]
 [7. 8.]]
[[ 6.  8.]
 [10. 12.]]


In [26]:
print(np.add(x, y))

[[ 6.  8.]
 [10. 12.]]


#### Subtracting matrices 

In [27]:
# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)

[[-4. -4.]
 [-4. -4.]]


In [28]:
print(np.subtract(x, y))

[[-4. -4.]
 [-4. -4.]]


#### Multiplying matrices 

In [41]:
# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)

[[ 5. 12.]
 [21. 32.]]


In [42]:
print(np.multiply(x, y))

[[ 5. 12.]
 [21. 32.]]


#### Dividing matrices 

In [31]:
# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)

[[0.2        0.33333333]
 [0.42857143 0.5       ]]


In [32]:
print(np.divide(x, y))

[[0.2        0.33333333]
 [0.42857143 0.5       ]]


#### Raising matrices to powers 

In [33]:
# Elementwise square root; both produce the same array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(x ** (1/2))

[[1.         1.41421356]
 [1.73205081 2.        ]]


In [34]:
print(np.sqrt(x))

[[1.         1.41421356]
 [1.73205081 2.        ]]


### Numpy is faster

Below, you will find a piece of code we will use to compare the speed of operations on a list and operations on an array. In this speed test, we will use the package [time](https://docs.python.org/3/library/time.html).

In [35]:
import time
import numpy as np

size_of_vec = 100000

def pure_python_version():
    t1 = time.time()
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X))]
    return time.time() - t1

def numpy_version():
    t1 = time.time()
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y
    return time.time() - t1


t1 = pure_python_version()
t2 = numpy_version()
print("python: " + str(t1), "numpy: "+ str(t2))
print("Numpy is in this example " + str(t1/t2) + " times faster!")

python: 0.02288198471069336 numpy: 0.0008256435394287109
Numpy is in this example 27.714120704591394 times faster!


In pairs, run the speed test with a different number, and share your results with the class.

## 3. Making our own module
<img src="https://media1.giphy.com/media/dW0KhIROCaAdCO0V3S/giphy.gif?cid=790b76115d36096678416c65519d8082&rid=giphy.gif" width=300>

In [36]:
# this option will re-import your module each time you save an update to it

%load_ext autoreload
%autoreload 2

In [37]:
import temperizer as tp
from temperizer import convert_f_to_c

## Example: Convert F to C

1. This function is already implemented in `temperizer.py`.
2. Notice that we can call the imported function and see the result.

In [38]:
# 32F should equal 0C
tp.convert_f_to_c(32)

0.0

In [39]:
# -40F should equal -40C
tp.convert_f_to_c(-40)

-40.0

In [43]:
# 212F should equal 100C
tp.convert_f_to_c(212)

100.0

## Enter: Pandas

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=600>  




- The data manipulation capabilities of Pandas are built on top of the numpy library.
- Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.

### 1. Importing and reading data with Pandas!

#### Let's use pandas to read some csv files so we can interact with them.



In [44]:
# First, let's check which directory we are in so the files we expect to see are there.
!pwd
!ls -la

/Users/mmitchell3/Desktop/Flatiron_DS_Lessons/Loading_Data_in_Pandas
total 144
drwxr-xr-x  13 mmitchell3  staff    416 Sep 16 09:10 [1m[36m.[m[m
drwxr-xr-x   5 mmitchell3  staff    160 Sep 15 09:09 [1m[36m..[m[m
-rw-r--r--@  1 mmitchell3  staff   6148 Sep 15 15:14 .DS_Store
drwxr-xr-x  16 mmitchell3  staff    512 Sep 15 16:22 [1m[36m.git[m[m
-rw-r--r--   1 mmitchell3  staff   1799 Sep 15 16:21 .gitignore
drwxr-xr-x   5 mmitchell3  staff    160 Sep 15 16:22 [1m[36m.ipynb_checkpoints[m[m
-rw-r--r--   1 mmitchell3  staff     24 Sep 15 09:09 README.md
drwxr-xr-x   3 mmitchell3  staff     96 Sep 16 09:10 [1m[36m__pycache__[m[m
drwxr-xr-x   5 mmitchell3  staff    160 Sep 15 15:32 [1m[36mdata[m[m
drwxr-xr-x  12 mmitchell3  staff    384 Feb 20  2020 [1m[36mimg[m[m
-rwxr-xr-x   1 mmitchell3  staff  45724 Sep 16 09:10 [31mlibraries_numpy_pandas-SOLUTION.ipynb[m[m
-rw-r--r--   1 mmitchell3  staff    208 Sep 16 09:09 temperizer.py
-rwxr-xr-x   1 mmitchell3  staff    

In [45]:
!cd data
!ls 

README.md                             [31mlibraries_numpy_pandas-SOLUTION.ipynb[m[m
[1m[36m__pycache__[m[m                           temperizer.py
[1m[36mdata[m[m                                  [31mtemperizer_SOLUTION.py[m[m
[1m[36mimg[m[m


In [46]:
import pandas as pd

example_csv = pd.read_csv('data/example1.csv')

There is also `read_excel`, `read_html`, and many other pandas `read_` functions.  
http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [47]:
example_csv

Unnamed: 0,Title1,Title2,Title3
0,one,two,three
1,example1,example2,example3


#### Your turn! 

Try loading in the example file in the `data` directory called `made_up_jobs.csv` using pandas.

In [48]:
#read in your csv here

### 2. Utilizing and identifying Pandas objects

- What is a DataFrame object and what is a Series object? 
- How are they different from Python lists?

These are questions we will cover in this section. To start, let's start with this list of fruits.

In [50]:
fruits = ['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']

print(fruits)

['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']


Using our list of fruits, we can create a pandas object called a 'series' which is much like an array or a vector.

In [51]:
fruits_series = pd.Series(fruits)

print(fruits_series)
type(fruits_series)

0         Apple
1        Orange
2    Watermelon
3         Lemon
4         Mango
dtype: object


pandas.core.series.Series

One difference between python **list objects** and pandas **series objects** is the fact that you can define the index manually for a **series objects**.

In [52]:
ind = ['a', 'b', 'c', 'd', 'e']

fruits_series = pd.Series(fruits, index=ind)

print(fruits_series)

a         Apple
b        Orange
c    Watermelon
d         Lemon
e         Mango
dtype: object


#### Your turn!

With a partner do the following:

- create your own custom series from a list
- be sure to include custom indices in your series
- print your series and it's data type to ensure you have completed the above

In [53]:
# create a list of lists


# create custom indices for your series



# create the series using your list objects and custom indices



# print your series



We can do a simliar thing with Python dictionaries. This time, however, we will create a DataFrame object from a python dictionary.

In [55]:
# Dictionary with list object in values
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New york']
}


students_df = pd.DataFrame(student_dict)

students_df.head()

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [56]:
#to find data types of columns
students_df.dtypes

name    object
age     object
city    object
dtype: object

Let's change the data type of ages to int.

In [57]:
students_df.age.values

array(['35', '17', '26'], dtype=object)

In [58]:
# We can also change a columns type but the change has to make sense.
students_df.age = students_df.age.astype(int)

#Uncomment line below and observe what happens when trying to convert student's name to int or float
#students_df.name = students_df.name.astype(int)

#How about what happens converting numeric to string
#students_df.age = students_df.age.astype(str)

students_df.dtypes

name    object
age      int64
city    object
dtype: object

We can also use a custom index for these items. For example, we might want them to be the individual student ID numbers.

In [59]:
school_ids = ['1111', '1145', '0096']

# Notice here we use pd.DataFrame not pd.Series as we did for a pandas series.
students_df = pd.DataFrame(student_dict, index=school_ids)

students_df.head()

Unnamed: 0,name,age,city
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


Using Pandas, we can also rename column names.

In [60]:
students_df.columns = ['NAME', 'AGE', 'HOME']
students_df.head()

Unnamed: 0,NAME,AGE,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [61]:
students_df.index

Index(['1111', '1145', '0096'], dtype='object')

Or, we can also change the column names using the rename function.

In [62]:
students_df.rename(columns={"AGE": "YEARS"})

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [63]:
# Notice what happens when we print students_df

students_df

Unnamed: 0,NAME,AGE,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


If you want the file to save over itself, you will have to reassign this renamed df to students_df.

In [64]:
students_df = students_df.rename(columns={'AGE': 'YEARS'})
students_df.head()

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


Similarly, there is a tool to remove rows and columns from your DataFrame

In [65]:
students_df.drop(columns=['YEARS', 'HOME'])

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


In [66]:
#Notice again what happens if we print students_df 
students_df

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [67]:
#again we need to assign this dataframe object to a variable
new_df = students_df.drop(columns=['YEARS', 'HOME'])
new_df = students_df

Another option to get the file to save over itself, is to use the option `inplace = True`.

Every function has options. Let's read more about `drop` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

In [68]:
students_df.drop(columns=['YEARS', 'HOME'], inplace=True)
students_df

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


### 3. Filtering Data Using Pandas

Suppose we want to look at only the rows where the students home was New York or we only wanted to look at those rows of data where the students were older than 25 years?  How would we do that?


There are several ways to grab particular data from a DataFrame. 
- Python lists allow for selection of data only through integer location. 
- You can use a single integer or slice notation to make the selection but NOT a list of integers.
- Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

### DataFrames can be indexed by column name (label) or row name (index) or by position.   
#### The `.loc` method is used for indexing by name.  
#### While `.iloc` is used for indexing by number.

In [69]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New york']
}

students_df = pd.DataFrame(student_dict)

In [70]:
students_df.loc[:, 'name']

0    Samantha
1        Alex
2       Dante
Name: name, dtype: object

### Let's take a look at `.iloc`
#### `.iloc` takes slices based on index position.
#### `.iloc` stands for integer location so that should help with remember what it does
#### `.iloc`[row , column]

In [71]:
# returns the first row
students_df.iloc[0]

name    Samantha
age           35
city     Houston
Name: 0, dtype: object

In [72]:
# returns the first column
students_df.iloc[:, 0]

0    Samantha
1        Alex
2       Dante
Name: name, dtype: object

In [73]:
# returns first two rows notice that ILOC performs regular python slicing.
students_df.iloc[0:2]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle


In [74]:
# returns the first two columns
students_df.iloc[:, 0:2]

Unnamed: 0,name,age
0,Samantha,35
1,Alex,17
2,Dante,26


In [75]:
# returns first row and columns 1 and 2
students_df.iloc[0:1, 0:2]

Unnamed: 0,name,age
0,Samantha,35


### Activity

In your group complete the following:

- Using `.iloc` return the last item in the last row.
- Using `.iloc` return the first item in the last column.
- Using `.iloc` return the first two rows in the first two columns.


In [76]:
# return the last item in the last row using iloc


In [77]:
# return the first item in the last column using iloc


In [78]:
# return the first two rows in the first two columns


### Let's take a look at `.loc`
#### Label based method. 
#### Names or labels of the index is used when taking slices.
#### Also supports boolean subsetting.

We will use loc to return rows and columns based on labels. Let's look at the students_df DataFrame again.


In [82]:
students_df

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


Let's look at the student information associated with the index 0


In [83]:
students_df.loc[0]

name    Samantha
age           35
city     Houston
Name: 0, dtype: object

Now let's look at the student information for rows with the indices 0 to 2 inclusive.

In [84]:
students_df.loc[0:2]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


Note:  `.iloc` would return normal python slicing and not include 2 as we have above.

We can also index based on the column labels.  Let's select just the column `age`

In [85]:
students_df.loc[:, 'age']

0    35
1    17
2    26
Name: age, dtype: object

Now what if we just wanted the `age` column for the rows with index values of 1 to 2 inclusive?

In [86]:
students_df.loc[1:2, 'age']

1    17
2    26
Name: age, dtype: object

##  Activity

In your group do the following:

- using `.loc` select just the `name` and `city` columns for the rows with indices of 0 and 2 (not including index 1
- using `.loc` select just the `name` and `age` columns for rows 2 and 3



In [87]:
# name and city for rows with index 0 and 2


In [88]:
# name and age for rows with index 1 and 2


We can also set a column as our index and then use `.loc` on that new index.

Let's change the index to name so that we can filter based on the students name.

In [91]:
students_df.set_index("name", inplace=True)
students_df

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New york


Great!  Now we can select the rows where the index (now names) is Samantha.

In [92]:
students_df.loc[['Samantha']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston


In [93]:
# Subsetting nonconsecutive rows
students_df.loc[['Samantha', 'Dante']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Dante,26,New york


In [94]:
# Samantha to the end
students_df.loc['Samantha':]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New york


### Boolean Subsetting

We can also subset our data using conditional statements.  For example, what if we wanted to filter our dataset so that it only shows students whose name is Samantha when names are in a column of our dataframe?

In [95]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante', 'Samantha'],
    'age': ['35', '17', '26', '21'],
    'city': ['Houston', 'Seattle', 'New york', 'Atlanta'],
    'state': ['Texas', 'Washington', 'New York', 'Georgia']
}

students_df = pd.DataFrame(student_dict)

The statement `[students_df[‘name’] == ‘Samantha’]` produces a Pandas Series with a True/False value for every row in the ‘data’ DataFrame, where there are “True” values for the rows where the name is “Samantha”.

These type of boolean arrays can be passed directly to the .loc indexer.

In [96]:
students_df.loc[students_df['name'] == 'Samantha']

Unnamed: 0,name,age,city,state
0,Samantha,35,Houston,Texas
3,Samantha,21,Atlanta,Georgia


What about if we only want the city and state of the selected students with the name Samantha?

In [97]:
students_df.loc[students_df['name'] == 'Samantha', ['city', 'state']]

Unnamed: 0,city,state
0,Houston,Texas
3,Atlanta,Georgia


What amount if we want to select a students who are 21?

In [98]:
students_df.loc[students_df['age'] == '21']

Unnamed: 0,name,age,city,state
3,Samantha,21,Atlanta,Georgia


## Activity

In your group do the following:

-  Select the rows of the dataframe where students are 21 years old and from Atlanta
-  Select the rows of the dataframe where students are 26 years old and from New York State


In [99]:
# 21 years old and from Atlanta


In [100]:
# What should be returned?
