# Week 02 Coding: Numpy, Pandas, and MatPlotLib (Lab)

## BEENOS Inc. Machine Learning Study Group

In the second week, you'll practice using `numpy`, `pandas` and `matplotlib`.

In [1]:
# add 2 to every element of a `list`

mylist = [1, 2, 3, 4, 5]

mylist + 2

TypeError: can only concatenate list (not "int") to list

But such an operation is trivial for `numpy`:

In [2]:
# add 2 to every element of an `ndarray`

import numpy as np

myarray = np.array([1,2,3,4,5])

myarray + 2

array([3, 4, 5, 6, 7])

We can also use `numpy` to create matrices from lists of lists:

In [3]:
# create a 3x3 matrix from a list of list

morelists = [[1,2,3], [4,5,6], [7,8,9]]

mymatrix = np.array(morelists)

mymatrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

### `numpy` Practice

For the first part of this lab, you'll be completing short exercises to get you used to working with `numpy` arrays.

The exercises are taken from [Machine Learning Plus](https://www.machinelearningplus.com/python/101-numpy-exercises-python/) and increase in difficulty from Level 1 to Level 4.

Feel free to skip around to exercises that challenge you.

Good luck!

**Q.0** Import numpy as `np` and print the version.

*Sample Output*:

> `1.15.4`

In [None]:
# INSERT CODE HERE

**Q.1** Use `np.array` to create a 1D array of numbers from 0 to 9.

*Output:*

> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
# INSERT CODE HERE

**Q.2** Extract all odd numbers from `arr`.

*Output:*

> array([1, 3, 5, 7, 9])

In [None]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# INSERT CODE HERE

**Q.3** Use `reshape` to convert a 1D array to a 2D array with two rows.

*Output:*

>     array([[0, 1, 2, 3, 4],
>            [5, 6, 7, 8, 9]])

In [None]:
arr = np.arange(10)

# INSERT CODE HERE

**Q.4** Use `np.full` to create an nD boolean array. Choose your size and `True` or `False`.

*Sample Output:*

>     array([[ True,  True],
>            [ True,  True],
>            [ True,  True]])

In [None]:
# INSERT CODE HERE

**Q.5** Use `np.where` to replace all odd numbers in `arr` with `-1`, but **DO NOT** change `arr`.

*Sample Output*:

> array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1]) `# newarray`

> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) `# arr`

In [None]:
arr = np.arange(10)

# INSERT CODE HERE

print(arr, newarray)

**Q.6** Create the following pattern. **Use numpy functions.**

*Output:*

> array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])

In [None]:
a = np.array([1,2,3])

# INSERT CODE HERE

**Q.7** Get all common items in both `a` and `b`.

*Output:*

> array([2,4])

In [None]:
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

# INSERT CODE HERE

**Q.8** Remove from `a` all items in `b`.

*Output:*

> array([0, 1, 2, 4])

In [None]:
a = np.array([0,1,2,3,4,5])
b = np.array([3,5,6,7,8,9])

# INSERT CODE HERE

**Q.9** Create a 2D array of size 5x3 containing random numbers between 5 and 10.

In [None]:
# INSERT CODE HERE

**Q.10** Use `np.genfromtxt` to import the Iris flowers dataset from the given url.

*Hint: If your last column shows `nan` values, trying specifying an `object` data type.*

*Output:*

> `# Print the first row`

> array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa']], dtype=object)

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# INSERT CODE HERE

## Part 1: Pandas

[CHANGE TO PANDAS]
NumPy is short for Numerical Python, and it is the most important package one needs when working with data in Python.

Other popular packages, like `pandas`, depend on the versatile functionality provided by numpy.

The core benefit to using `numpy` is access to its `ndarray` objects.

`ndarray` is short for **n-dimensional arrays**, which allow us to perform vectorized mathematical operations and data manipulation on collections of data.

### `pandas` Practice

For the next part of this lab, you'll be completing short exercises to get you used to working with `numpy` arrays.

The exercises are taken from [Machine Learning Plus](https://www.machinelearningplus.com/python/101-pandas-exercises-python/) and increase in difficulty from Level 1 to Level 3.

Feel free to skip around to exercises that challenge you.

Good luck!

**Q.0** Pandas relies on `numpy`. You should have already imported `numpy`. If not, import it and also import pandas (by convention as `pd`) and check your `pandas` version.

*Sample Output*:

> `0.23.4`

In [None]:
# INSERT CODE HERE

In [5]:
import pandas as pd
import numpy as np

pd.__version__

'0.23.4'

**Q.1** Like `numpy`, `pandas` has popular structures that it is well known for. The first one is called a **Series**, which is a one-dimensional array. In laymans terms, a Series is simply a *column* of data, like one might find in an Excel spreadsheet.

A Series can hold any data type and can be coerced from multiple data structures. To test this, convert each of the following data structures (a list, an array and a dictionary) into a separate **Series**.

*Sample Output*:

> `# Print the head of the mydict series`

>     a     0
>     b     1
>     c     2
>     d     3
>     e     4
>     dtype: int64

In [29]:
mylist = list('abcdefghijklmnopqrstuvwyxz')
myarray = np.arange(1, 27)
mydict = dict(zip(mylist,myarray))

mylist_to_series = pd.Series(mylist) # INSERT CODE HERE
myarray_to_series = pd.Series(myarray) # INSERT CODE HERE
mydict_to_series = pd.Series(mydict) # INSERT CODE HERE

mydict_to_series.head()

a    1
b    2
c    3
d    4
e    5
dtype: int64

**Q.2** Series can also be coerced into the second fundamental `pandas` data structure, the **DataFrame**. It is a two-dimensional array made up of Series. Again, in laymans terms, a DataFrame is like one spreadsheet in Excel: it contains multiple columns.

Each column of a DataFrame cna hold any data type, and can be created from Series or other data structures. Let's try this by turning your `mydict` series into a `pandas` DataFrame. Use `.reset_index()` to add *both* columns to the table.

*Sample Output*:

> `# Print the head of the dataframe`

>           index  0
>     0     a      0
>     1     b      1
>     2     c      2
>     3     d      3
>     4     e      4

In [30]:
myseries = pd.Series(mydict)

mydataframe = myseries.to_frame().reset_index() # INSERT CODE HERE

mydataframe.head()

Unnamed: 0,index,0
0,a,1
1,b,2
2,c,3
3,d,4
4,e,5


**Q.3** Our new dataframe looks a bit clunky. One problem is the column names `"index"` and `"0"` don't tell us much about the data in the table.

It might be easier to read if we change those names to `"letter"` and `"position"`, or even `"number"`. You can use `.rename()`.

*Hint: Does your head show `'index'` and `'0'` still? Try using `inplace`!*

*Sample Output*:

> `# Print the head of the dataframe`

>           letter  number
>     0     a       1
>     1     b       2
>     2     c       3
>     3     d       4
>     4     e       5

In [36]:
mydataframe.rename(columns={"index":"letter", 0:"number"}, inplace=True) #INSERT CODE HERE

mydataframe.head()

Unnamed: 0,letter,number
0,a,1
1,b,2
2,c,3
3,d,4
4,e,5


**Q.4** Alright, let's practice with a bit of real data. We'll be looking at the [Cars93](https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv) dataset for a few problems.

The dataset has already been loaded for you. Figure out how many rows and columns it contains.

*Sample Output*:

> (93, 27)

In [79]:
cars_df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv")

cars_df.shape # INSERT CODE HERE

(93, 27)

**Q.5** Let's get a bit more info on our dataset. First, we want to figure out the *type* of the data that exists within each column. (What are we working with? Strings? Integers? Floats? Datetime objects?)

This is extremely important to know for manipulating our dataset later, because **all the data in a pandas *Series* or a DataFrame *column* must be the same**.

What types are in the Cars93 dataset?

*Sample Output (first 3 data types)*:

>     Manufacturer    object
>     Model           object
>     Type            object
>     dtype: object

In [80]:
cars_df.dtypes[:3] # INSERT CODE HERE

Manufacturer    object
Model           object
Type            object
dtype: object

**Q.6** As you work with real data, you'll soon find out that the more you *know* about the dataset you're working with, the easier it is to build models and make predictions.

One of the things you'll want to know is **summary statistics**: things like mean, standard deviation, maximum and minimum values, etc. Knowing these stats will help you determine whether your data needs to undergo transformation (like scaling and normalization) before modeling.

Show the summary stats for this dataset now. What is each row telling you?

*Sample Output*:

> `# First few columns of the first row`

>               Min.Price       Price           Max.Price
>     count     86.000000       91.000000       88.000000       ...
>     mean      17.118605       19.616484       21.459091       ...
>     ...

In [81]:
cars_df.describe() # INSERT CODE HERE

Unnamed: 0,Min.Price,Price,Max.Price,MPG.city,MPG.highway,EngineSize,Horsepower,RPM,Rev.per.mile,Fuel.tank.capacity,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight
count,86.0,91.0,88.0,84.0,91.0,91.0,86.0,90.0,87.0,85.0,91.0,89.0,92.0,87.0,88.0,89.0,74.0,86.0
mean,17.118605,19.616484,21.459091,22.404762,29.065934,2.658242,144.0,5276.666667,2355.0,16.683529,5.076923,182.865169,103.956522,69.448276,38.954545,27.853933,13.986486,3104.593023
std,8.82829,9.72428,10.696563,5.84152,5.370293,1.045845,53.455204,605.554811,486.916616,3.375748,1.045953,14.792651,6.856317,3.778023,3.304157,3.018129,3.120824,600.129993
min,6.7,7.4,7.9,15.0,20.0,1.0,55.0,3800.0,1320.0,9.2,2.0,141.0,90.0,60.0,32.0,19.0,6.0,1695.0
25%,10.825,12.35,14.575,18.0,26.0,1.8,100.75,4800.0,2017.5,14.5,4.0,174.0,98.0,67.0,36.0,26.0,12.0,2647.5
50%,14.6,17.7,19.15,21.0,28.0,2.3,140.0,5200.0,2360.0,16.5,5.0,181.0,103.0,69.0,39.0,27.5,14.0,3085.0
75%,20.25,23.5,24.825,25.0,31.0,3.25,170.0,5787.5,2565.0,19.0,6.0,192.0,110.0,72.0,42.0,30.0,16.0,3567.5
max,45.4,61.9,80.0,46.0,50.0,5.7,300.0,6500.0,3755.0,27.0,8.0,219.0,119.0,78.0,45.0,36.0,22.0,4105.0


**Q.7** Let's use our summary statistics to learn more about our data.

Find the `manufacturer`, `model` and `type` of car that has the highest `Price`. Use `.loc[]` to pull out this information.

*Sample Output*:

>               Manufacturer    Model     Type       Price
>     58        Mercedes-Benz   300E      Midsize    61.9

In [82]:
cars_df.loc[cars_df['Price'] == cars_df['Price'].max()][['Manufacturer', 'Model', 'Type', 'Price']] # INSERT CODE HERE

Unnamed: 0,Manufacturer,Model,Type,Price
58,Mercedes-Benz,300E,Midsize,61.9


**Q.8** We want to see if our dataset has any missing values. (If it does, we may need to transform them to make the dataset compatible with machine learning algorithms.)

Find all the missing values in the dataset now.

*Sample Output*:

>               Manufacturer    Model     Type       Price
>     58        Mercedes-Benz   300E      Midsize    61.9

In [83]:
cars_df.isna().sum() # INSERT CODE HERE

Manufacturer           4
Model                  1
Type                   3
Min.Price              7
Price                  2
Max.Price              5
MPG.city               9
MPG.highway            2
AirBags                6
DriveTrain             7
Cylinders              5
EngineSize             2
Horsepower             7
RPM                    3
Rev.per.mile           6
Man.trans.avail        5
Fuel.tank.capacity     8
Passengers             2
Length                 4
Wheelbase              1
Width                  6
Turn.circle            5
Rear.seat.room         4
Luggage.room          19
Weight                 7
Origin                 5
Make                   3
dtype: int64

**Q.9** Normally, we'd need to take a closer look at each column containing `null` values before deciding what to do with them. But for the purpose of this notebook, let's just assume we want to get rid of them!

If you ran the above cell correctly, it should show that the `'Model'` column only has one missing value. We shouldn't miss this one too much. **Drop this row now.** Then, check to see how many missing values are in that column.

*Hint: If your code still shows 1 missing value in `'Model'`, try using `inplace`!*

*Output:*

> `# Shape of the dataframe with the missing value 'Model' row gone`

> (92, 27)

In [88]:
# drop the row with a missing value in `Model`
cars_df = cars_df.loc[cars_df['Model'].isna() == False] # INSERT CODE HERE

# check the shape of the dataframe; there should be 92 rows
cars_df.shape

(92, 27)

In [89]:
# check that the missing value from `Model` was removed
cars_df['Model'].isna().sum() # INSERT CODE HERE

0

**Q.10** We can drop entire rows as well. Again, in real life we'd want to investigate before doing this. But this is just for practice! So let's try it out.

If you ran the Q.8 code cell correctly, you should see that the column `'Luggage.room'` has 19 missing values. That's way more than any other column of the dataset!

Get rid of this column now.

*Output:*

> `# Shape of the dataframe with "Luggage.room" gone`

> (92, 26)

In [93]:
# INSERT CODE HERE

cars_df.shape

(92, 26)

42
64
67
75

- create/read df
- describe
- summary
- indexing (loc, iloc)
- sorting
- groupby/agg
- transposing

From MLM:
```python
import numpy
import pandas
myarray = numpy.array([[1, 2, 3], [4, 5, 6]])
rownames = ['a', 'b']
colnames = ['one', 'two', 'three']
mydataframe = pandas.DataFrame(myarray, index=rownames, columns=colnames)
print(mydataframe)
```

EXERCISES: https://www.machinelearningplus.com/python/101-pandas-exercises-python/

## Part 4: Matplotlib

[intro to what matplotlib is, and why we use it]

[what they'll be doing in this section]

- draw simple plot
- create multiple subplots
- piechart
- histogram
- bar plot
- saving a plot

https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html

https://heartbeat.fritz.ai/introduction-to-matplotlib-data-visualization-in-python-d9143287ae39

https://www.tutorialdocs.com/article/python-matplotlib-tutorial.html

EXERCISES: https://www.w3resource.com/graphics/matplotlib/