# Intro to Python for Data Science



---
<img src="https://calnerds.berkeley.edu/css/images/logo.jpg"  /> <!--style="width: 500px; height: 275px;"-->



## Day 2


### Table of Contents


1 - [Functions](#section1)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [Built-in Functions](#subsection1.1)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Defining Functions](#subsection1.2)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.3 - [Importing Libraries](#subsection1.3)<br>
2 - [Intro to Numpy](#section2)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 - [Creating Fast Arrays in Numpy](#subsection2.1)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 - [Array Operations](#subsection2.2)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3 - [Indexing Inside Arrays](#subsection2.3)<br>
3 - [Intro to PandasData Frames](#section3)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.1 - [Importing Data & Summary Statistics](#subsection3.1)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.2 - [Indexing &  Slicing ](#subsection3.2)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.3 - [Manipulating Columns](#subsection3.3)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.4 - [Boolean Indexing](#subsection3.4)<br>
4 - [Plotting](#section4)<br>



---
# 1. Functions<a id='section1'></a>

### 1.1 Built-in Functions<a id='subsection1.1'></a>

Python has some built-in (aka available by default) functions and methods. Let's go over a few of these.


If something is a built-in function, it will appear in green in your coding environment. 

In [None]:
# EXAMPLE 

max(2, 5, 9, 7)

In [None]:
# EXAMPLE 

min(3, 7)

How will this function work with textual information (aka strings)?
Pause for a moment to answer this question before you run this cell.

In [None]:
# EXAMPLE 

max("dog", "Dog")

Another built-in function you will most likely use in your coding a lot, is  **len(   )**. It is used to learn about the length of different objects. 

In [None]:
# EXAMPLE

len("dog")

In [None]:
# EXERCISE

# what do you think this cell will output?


len("dog ")

When used with lists, it counts only the "outter" objects. Meaning, if there are lists or arrays inside of your list, no matter however many objects each of them has, the len(  ) will count the whole list (array, tuple, dictionary) as a single object.

In [None]:
# EXAMPLE


len([1,3,4])

In [None]:
random = [[1,2,3,4],[4,5,6]]

# Before running this cell, can you guess the output 
# it will give you?

len(random)

In the example below, we will use a built-in method. There's no big difference between the two except for the syntax.

Functions (not only the built-in ones) are usually used in the form of   

**function(argument 1, argument n)**

Unlike functions, methods are usually added at the end of the variable.

**variable.method()**

In [None]:
# EXAMPLE 

hello = "Hello World"

print(hello.upper())
print(hello)

Have you noticed that the variable **hello** didn't change? Can you think why is that? 

In [None]:
# EXAMPLE 


hello = hello.lower()
hello

In the cell below, create a variable **phrase** which tells us what school you go to. Then change all the letters to upper-cased letters.

In [None]:
# EXERCISE 


phrase = "I go to ..."
...

print(phrase)

### 1.2 Defining Functions <a id='subsection1.2'></a>

A function is a block code that you can reuse to perform a specific action. In the next cell, we will be defining a simple one-line function that takes one argument.

In [None]:
# EXAMPLE

def square(x):
    return x**2


a = square(3)
a

Your functions can perform multiple calculations and use more than one argument. You can also incorporate for-loops, conditionals, and other functions inside of your function.


**Note:** we won't be going over the nested functions (functions with other functions inside, but feel free to search for them).

In the cell below, we are going to define a function that will multiply the two numbers you give it only if they are not equal to each other.

In [None]:
# EXAMPLE

def mult_not_eq(x, y):
    # if the first number is not equal to the second
    print(x, y)
    if x!=y:
    # I will multiply them
        return x*y
    else:
        print("Use the square function instead")
        
        
mult_not_eq(2, 2)

Now it's time for you to define a function. Create a function that will add up two numbers, but only if they are not equals.

In [None]:
# EXERCISE

def sum_not_equals(..., ...):
    if ...:
        return ...
    else:
        ...
    
    
sum_not_equals(1, 5)

---
### 1.3 Importing Libraries <a id='subsection1.3'></a>



Python is a relatively compact programming language by itself. But while you can build all your functions from scratch, that is unnecessary in most cases. Python programs have access to a lot of already built **"libraries"** - collections of pre-coded functions. To use them, you normally need to install them on your computer (not needed for most common libraries if you use Anaconda) and then import them into your Jupyter Notebook. 

In [None]:
# EXAMPLE

import numpy as np

Now that we imported **numpy** library, we can use it's functions.

In [None]:
# EXAMPLE

rand_nums = [1,5,8,9,15]

np.mean(rand_nums)

Another useful library to have is **math**. It has things like cos, sin, e, and pi pre-built so that you can use them in your calculations.

In [None]:
# EXAMPLE

import math

radius = 3
area_circle = math.pi * radius**2

area_circle

# 2. Numpy (Optional)<a id='section2'></a>
Numpy is an essential library both for Data Science and for many other applications. It's primary purpose is to give you access to fast arrays/matrices of data for numerical computations. But it also contains a number of very useful functions you may want to use even if you don't need to use fast arrays. In other words, if there is one library you are most likely to need to write your code, it's probably going to be numpy. 

In [None]:
# It is customary to import numpy under the name "np"
import numpy as np

### 2.1 Creating Fast Arrays in Numpy <a id='subsection2.1'></a>

The main reason to use numpy is its blazingly fast array data structure the numpy array. Once you import numpy into your code you will be able to create arrays of data from other data structures like lists or generate completely new arrays of data according to your requirements. The easiest way to create a numpy array is to give numpy a list of values you want it to put in that array. 

In [None]:
list_of_integers = [1,2,3,5,6,9,117,377,1456783]

In [None]:
new_array = np.array(list_of_integers)
new_array

Alternatively, you can ask numpy to create a blank array or an array with dummy values (zeroes or ones) and then fill it up with data as needed.

In [None]:
array_of_ones = np.ones(9)
array_of_ones

In [None]:
array_of_zeros = np.zeros(15)
array_of_zeros

In [None]:
# If you want an array of another  number you can create an array of ones 
# and then multiply that entire array with the number you want
array_of_threes = np.ones(10) * 3
array_of_threes

Operations with numpy arrays can be anywhere from dozens to thousands of times faster than with equivalent lists. One of the reasons why numpy is so fast is that it only allows data of one type per array. Numpy is pretty clever in automatically determining what type of data you want to store in your arrays. When creating arrays it will convert all your data into the most general type that your entire dataset could be.

In [None]:
array_of_integers = np.array([145353,23,33789,6653,6,9878,117,377,1456783])
array_of_integers

In [None]:
array_of_strings = np.array(['this', 'is', 'a', 'collection', 'of', 'strings'])
array_of_strings

In [None]:
mixed_array = np.array([1,5,7,3,13422, 'this is a string'])
mixed_array

In [None]:
sneaky_mixed_array = np.array([1,4,6,8,9,'3'])

Just like with regular Python lists you can easily create nested arrays in order to represent 2D, 3D, and multi-dimensional data.

In [None]:
two_D_list = [[1, 2, 3,], [4, 5, 6], [7, 8, 9]]
two_D_list

In [None]:
two_D_array = np.array(two_D_list)
two_D_array

### 2.2 Indexing Inside Arrays  <a id='subsection2.2'></a>

One of the most important uses of numpy arrays is to store an retrieve data quickly. If you want a particular value from an array, you can acces that value exaclty the same way you would from a list

In [None]:
simple_list = [1,2,3,4,5,6]
simple_list[2]

In [None]:
simple_array = np.array(simple_list)
simple_array[2]

In [None]:
two_D_array

In [None]:
two_D_array[0, 0]

In [None]:
two_D_array[0, 2]

In [None]:
simple_array[1:3]

In [None]:
two_D_array[1:, 1:]

### 2.3 Array Operations <a id='subsection2.3'></a>

In [None]:
my_array = np.array([0, 1, 2, 3, 4, 5, 6])
my_array

#### Add  3 to array

* adds 3 to each element in the array 

In [None]:
my_array + 3

#### Multiply by 3

* multiplies by 3 each element in the array 

In [None]:
my_array * 3


#### Operations can apply to more than one array. 


In [None]:
array_one = np.array([1, 2, 3])
array_two =  np.array([4, 5, 6])

In [None]:
array_three  = array_one + array_two 
array_three

In [None]:
array_one * array_two

In [None]:
array_one.dot(array_two)

In [None]:
np.dot(array_one, array_two)

---
## 3. Intro to Pandas Data Frames<a id='section3'></a>

In [None]:
import pandas as pd

### 3.1 Importing Data & Summary Statistics  <a id='subsection3.1'></a>

We will use the function `read_csv()` in the _pandas_ library to import and read our data. The _csv_ at the end of the function tells the program to read a comma-delimited file. However, there are many other types of delimiters such as tab, semicolon, pipe, etc. 

We will now read a the _iris.csv_ file as a **DataFrame** and store it in a variable called _iris_.

In [None]:
iris = pd.read_csv('data/iris.csv')

Great! Now let's explore our data set. 

We will begin by using the method (or function)  `.head()`. By default, it will show the first 5 rows of or data set, but you can tell it to display the first n results by _passing n as an argument to `.head()`.

In [None]:
iris.head()

You can also see the last _n_ rows of our data using the method `.tail()`.

In [None]:
iris.tail()

`DataFrames` contain rows and columns. If you want to understand the structure of your DataFrame, there a few functions and attributes that might come handy. 

These include
* `shape`
* `columns`
* `index`
* `info()`
* `describe()`
* `len()`

In [None]:
iris.shape

The iris DataFrame contains 150 rows and 5 columns.

In [None]:
iris.columns

In [None]:
iris.index

In [None]:
iris.info()

As with lists and arrays, you can also use the function `len()` to see how many rows or elements our data set contains.

In [None]:
len(iris)

Another cool method is `.describe()`. Describe provides you with some basic statistics about each of the variables in your DataFrame including measures for tendency, dispersion and shape of a
dataset's distribution, excluding **NaN** values.
* By default, it will return the summary statistics of the numeric columns, but it can also work with mixed data. If the method is called on strings it will return measures such as the count, number of unique values, and the most frequent value.

In [None]:
iris.describe()

### 3.2 Indexing &  Slicing  <a id='subsection3.2'></a>

#### .loc[rows-label(s),columns-label(s)]
`.loc` Helps us view and index our DataFrame. 
* It works with string labels. Notice that most of the times you will have specific column names, but our row names often come as a number. Hence the label of the rows will be a number.   
* It can take 
    * one label __(df.loc[row-label, 'col-label-1'])__
    * a list of labels __(df.loc[[row-label 1, row-label-2, row-label-4],['col-label-1',  'col-label-2', 'col-label-4']])__
    * or a _slice_ of labels __(df.loc[row label-50 : row-label-100,'col-label-1': 'col-label-8'])__


#### Rows

Let's use loc to see what are the values in row 10 in our DataFrame

In [None]:
iris.loc[10]

* _Noticed that if our rows were labeled with textual information, we would have to use that name instead of "10". In this case the label for the 10th row is indeed 10. 

What if we want to see what are the values in row 5, 10, and 15? Let's pass 5,10, 15 into `loc` as a list of values. 


In [None]:
iris.loc[[5,10,15]]

This returned a `DataFrame` whereas the first returned a `series`. This is because on this one we selected a range of values. 

How would you use loc to see what are the values of rows 10-20? Yes, you can use a list like in the example above, but it can be quite cumbersome to have to type each number from 10 - 20. There is a better way, and this is slicing, just like we did with arrays and lists. 

In [None]:
iris.loc[10 : 20]

#### Columns 

Great! Now that you know how to index rows, let's see how we can index columns. Don't forget that we are still using `loc`, so we will have to use column labels.

Let's begin by indexing by one column, variety.

In [None]:
iris.loc[: ,'variety']

Another way to index by only one column is by adding the column label in a list. 

In [None]:
iris.loc[: ,['variety']]

The difference between these two is that the first returned a `series` because only selected a label, and the second returned a n*1 `DataFrame` because we passed a list.  

Noticed that here we had to specify the range of rows that we want to index that column by. We used `:` in order to return all values in the column.

Now, let's index by more than one column. Just as before we will use a list containing our desired column labels. 

In [None]:
iris.loc[:,['sepal.length', 'sepal.width','variety']]

Just as we sliced rows, we can do the same with column. 

In [None]:
iris.loc[:,'sepal.length': 'petal.width']

#### .iloc[rows_index,columns_index]

Another way to index is using `.iloc`. `iloc` allows us to index using integer positions. 


#### Rows

In [None]:
iris.iloc[[1,3,6,8,9]]

Recall the __start:stop:step__ from lists? Well we can also select a range of rows with a specified step value in our data DataFrame. In here we will take every 5th element from the 50th row to the 150th row. 

In [None]:
iris.iloc[50:150:5]

#### Columns

As we mentioned before `iloc` works just like `loc`, but instead of using labels we use the index. Let's get all the rows in the fifth column. Don't forget that we are starting at the 0th index.

In [None]:
 iris.iloc[:,4]

### 3.3 Manipulating Columns  <a id='subsection3.3'></a>

Let's load in a new data set called "cereal". We will use `pd.read_csv` just as we did before.

In [None]:
cereal = pd.read_csv('data/cereal.csv')

In [None]:
cereal

####  Uniqueness

Suppose that we want to find out the number of unique manufacturers in our data. The `.unique()` method allows us to check this. 

There are two ways to accomplish this, one is using the "dot" notation, and the other using brackets. For the most part, we will stick to the second method as it can be easy to run into errors.

1)__df.column_label.unique()__

2) __df['column_label'].unique()__



In [None]:
print('There are ',cereal['mfr'].nunique() ,'unique manufacturers')
print('These are: ', cereal['mfr'].unique())

Notice that we used the method `.nunique()` to tell us how _many_ unique items we have rather which items. An alternative way to compute this is __len(cereal['mfr'].nunique())__.

#### Frequencies

More specifically, say we want to know how many cereals exist per manufacturers. In this case, we would like to use the `.value_counts()` method instead. This method returns the counts for the unique values in our column. 

In [None]:
cereal['mfr'].value_counts()

Notice that this method sorts our values in decreasing order? What if you had an alternative sorting that you wanted to use? Maybe you want to sort by index, that is, in alphabetical order. In this case you would want to use the `sort_index()` method as seen below. 

#### Sorting

In [None]:
cereal['mfr'].value_counts().sort_index()

If instead you wanted to sort by counts, but in ascending order, you can use the `.sort_values()` method instead with the argument __ascending = True__.

In [None]:
cereal['mfr'].value_counts().sort_values(ascending=True)

#### Min, Max, & Range

Say that for our analysis we want to understand our cereals by the rating feature.

A good starting point might be to see what the __min__ and the __max__ are for our data. We can do this by using the functions `.min()` and `.max()` respectably. 

In [None]:
print('Min rating is :', cereal['rating'].min())

print('Max rating is :', cereal['rating'].max())

To get the range, all you need to do is subtract the min from the max!

Tip: Create a variable for the max and the min so that you don't have to spend time rewriting your code! If you don't remember how to do this, go back to Lesson 1.

Bonus: Use the function `round` to round these two numbers to decimal places. 

#### Missing Values

A common problem that you will come across when analyzing data is __missing__ data. You can check if you data set contains by using the function `.isnull()`. This function returns True whenever a values is missing and False whenever it is not. We can combine this function with `.sum()` to add up all the values that are True  & False.

** In Python (as in most programming languages), True is represented by 1, and False by 0. So using the `sum` function allows us to treat these True/False as numerical values. 

In [None]:
cereal.isnull().sum()

Notice that for the example above we checked for the number of missing values in each of the columns? What if you only wanted to do it for one? You can use the same methods we discuss prior, that is bracket and dot notation.

In [None]:
cereal['rating'].isnull().sum()

#### Groupby 

Now, say that we want to find the average amount of calories for the cereals per manufacturer. We can use an operation called `.groupby()`. `.groupby()` involves a combination of splitting an object (a series or column), applying a function (for example `.sum()`,`.mean()`, or `.count()`), and combining the results. 


In [None]:
cereal.groupby("mfr")['calories'].mean()

Let's break down what happened above. We begin with a DataFrame (`cereal`) and tell pandas (our library) to group by a column (`mfr`). Then we need to specify what column (`calories`) we want to operate our desired operation (`mean`) on.

In [None]:
cereal.head()

You can also group by more than one column! You just need to add the columns in a list. 

For instance, let's get the number of cereals by type and  manufacturer.

In [None]:
cereal.groupby(["type", "mfr"])['name'].count()

The outcome from the groupby above resulted in a `Seires`. If instead you would like to return your data as a `DataFrame` we have to use an additional brackets around the column that we are calling the action on, on this case `name`.

In [None]:
cereal.groupby(["type", "mfr"])[['name']].count()

### 3.4 Boolean Indexing
 <a id='subsection3.4'></a>

Suppose we only want to look at the cereals that behave less than 100 calories. We will use __boolean indexing__ to create a DataFrame that meets this criteria. 

We will accomplish this by:
1. Selecting the `calories` column from the DataFrame. 
2. We will create an array (or list) of Booleans where each value is True if and only if the value in the calories is less than 100, otherwise it will return False. You will have to use a boolean operator such as <,>, <=,>=, ==, !=, etc. on the column. 
3. Use the array of True/False values to only slice the values that correspond to True rows from our datafram

In [None]:
cereal['calories'] < 100

In [None]:
tf_array = cereal['calories'] < 100

In [None]:
cereal[tf_array]

In [None]:
cereal[cereal['calories'] < 100]

In [None]:
cereal[(cereal['calories'] < 100) & ((cereal['calories'] > 70))]

In [None]:
# EXERCISE
# Select only the rows that have more than 200 calories from cereal
...

# 4. Plotting <a id='section4'>

The most commonly used Python plotting library is _matplotlib_. It is robust and has very rich functionality but also some basic methods which are easy to use via simple commands.


In [None]:
import matplotlib.pyplot as plt

But there are many other plotting libraries (many of them built on top of matplotlib) which offer richer plotting functionality or simplified syntax for advanced commands. Let's use the _seaborn_ library to look at some of those as well.


In [None]:
import seaborn as sns 

The most basic and probably the most heavily used plot is always the scatterplot. Here is how to create a scatterplot with just one line of matplotlib code.

In [None]:
plt.scatter(cereal['calories'], cereal['sugars'])

In [None]:
plt.scatter(cereal['calories'], cereal['rating'])

If we want to take get a sense for the distribution of values in one of our columsn we can take a look at the histogram for that column like so.

In [None]:
plt.hist(cereal['calories'])

Let's now use _seaborn_ copare the distribution of our data grouped by manufacturer. Can you do this with _matplotlib_? Sure, but you'll have to do a bunch manual sorting. It is much easier to find the correct command in _seaborn_ which does all the sorting for us.

In [None]:
sns.boxplot(data=cereal, x='mfr', y='calories')

Another extremely common type of plot is the humble line plot. Let's use it to look at the relationship between calorie count and rating in our cereal data.

In [None]:
plt.plot(cereal['calories'], cereal['rating'])

Looks like our _rating_ values are all mixed up. We better do some sorting before we can get a useful line plot.

In [None]:
sorted_cereal = cereal.sort_values('rating')

In [None]:
plt.plot(sorted_cereal['rating'], sorted_cereal['calories'])

There we have it, clear as day - the _lower_ the calorie count the higher the rating. 

But what if we wanted to see if we can fit a quick regression line can we do that? Sure, but we'd better use _seaborn_ for that instead of _matplotlib_. _Seaborn_ has this capability built in while for _matplotlib_ we would have to manually fit the regression line using other libraries.

In [None]:
sns.lmplot(x ='rating', y ='calories', data = sorted_cereal)

_Seaborn_ can even fit non-linear regression lines!

In [None]:
sns.lmplot(x ='rating', y ='calories', data = sorted_cereal, order=2)

In [None]:
sns.lmplot(x ='rating', y ='calories', data = sorted_cereal, order=3)

Notebook developed by: Kseniya Usovich, Karla Palos, and Anton Bosneaga

Cal NERDS GitHub: https://github.com/Cal-NERDS