# Introduction to Data Science

This notebook walks through both concepts and libraries that will be important for the intro to data science project. 

We will cover: 
- using libraries: 
    - numpy, pandas, matplotlib, collections, tqdm, sklearn
- list comprehensions, enumerate, 2D list slicing 
- linear regression


## What is a Library? 

A library is a collection of functions, usually written by awesome people in the Python community. 
Although some libraries can seem big and complicated, they actually are easy to use and create. 
We’re going to quickly show you how you can make your own library, 
and then we’ll show you how to download and use popular python libraries.

#### note: this has to be done as a separate py file.. code below is to copy-paste :)

In [26]:
def sum_num(x: float, y: float) -> float:
    ''' Simple function that sums two numbers x and y '''
    return x + y


def diff_num(x: float, y: float) -> float: 
    ''' Simple function that sums two numbers x and y '''   
    return x - y

def count(x: int) -> None:
    ''' Sequential counting '''
    
    print('Let\'s count together!')
    
    for i in range(x):
        print(f'Step: {i}')

    return

Now we have our own library saved as a python file... but how do we use these functions? We will have to tell python to import any libraries into our existing code. Let's look at how we can use the import statement. 

In [21]:
# this first example imports *everything* from the library 
# (note for my OOP friends; this includes private functions -> use from mylibrary import * to import public only)

import mylibrary

x = 10
y = 11 

a = mylibrary.sum_num(x, y)
b = mylibrary.diff_num(x, y)

print(f'sum: {a} \ndifference: {b}')

sum: 21 
difference: -1


In [25]:
# what if you don't want to write a long name like mylibrary everytime? You can alias the library name

import mylibrary as ml

x = 10
y = 11 

a = ml.sum_num(x, y)
b = ml.diff_num(x, y)

print(f'sum: {a} \ndifference: {b}')

sum: 21 
difference: -1


In [24]:
# We can save some memory by only importing functions we need in a file by using the from syntax

from mylibrary import sum_num

x = 10 
y = 11

a = mylibrary.sum_num(x, y)

print(f'sum: {a}')

sum: 21


Great! Now we see that importing libraries just means you're importing functions stored in other files. You now know how to use the import statement to import entire libraries, or just functions that you want from the libraries. We will be introducing 3 very useful (and commonly used) libraries in the next section that will help us analyze our data quickly and efficiently.

####  Using Numpy

Let’s start with an example of a one-dimensional data set -- the ages of a group of friends. We can represent this as a list:

In [41]:
ages = [17, 17, 18, 19, 16, 20]

[17, 17, 18, 19, 16, 20]


By now, we know how lists work. Although they're nice, they can be very inefficient, especially when the lists get very long. Numpy (NUMerical PYthon) gives us a way to represent these lists as an array. You can think of it as casting one type to another (e.g., int('5'))

In [38]:
import numpy as np 
from pprint import pprint

# use np.array to make a copy of the original ages list
ages_np_copy = np.array(ages)

# or use np.asarray to manipulate the original ages list
ages_np_inplace = np.asarray(ages)

Remember how we addressed the efficiency of a numpy array vs a list? This comes in handy when we want to work with multi-dimensional arrays. 

Say we want to know the height of everyone in this group and see if height is correlated to age. We would have to make two separate lists with regular python, but we can make a 2-dimensional array, or data table, with numpy:


##### note: we use pprint (a.k.a "pretty-print") because it shows us that we're dealing with a list now and prints the data in a prettier way

In [43]:
heights = [165, 170, 169, 150, 167, 171]
friends = np.asarray([ages, heights])
pprint(friends)

array([[ 17,  17,  18,  19,  16,  20],
       [165, 170, 169, 150, 167, 171]])


We can see that our ‘friends’ array is structured as a table with two rows and three columns, where the first row is the age of each friend and the second row is the height (in centimetres) of each friend. 

What if we wanted to change the view of this array -- it makes more sense to have the ages be rows, and the height be a column. We can use the transpose function to make the columns be rows, and rows be columns. Let's see what that looks like:

In [40]:
friends_transpose = friends.transpose()
pprint(friends_transpose)

array([[ 17, 165],
       [ 17, 170],
       [ 18, 169],
       [ 19, 150],
       [ 16, 167],
       [ 20, 171]])


Now the first column is for age and the second column is for height, and each row is a friend. This is usually how we work with data: each variable gets its own column, and each observation gets its own row.

Numpy lets us calculate statistics about our data, such as finding the mean age and height in the friend group. To do this, we have to tell numpy where we want to calculate our statistics by defining "axis". The axis argument is just a fancy way of saying row (axis = 0) or column (axis = 1). 

We want to find the average among the ages and heights. If you do this by hand, you would sum each column separately, and divide over the number of _rows_ 

In [47]:
np.mean(friends_transpose, axis=0)

array([ 17.83333333, 165.33333333])

Numpy is powerful for working with numbers, but it can handle other types of data, too. It would be useful to see which row refers to which friend without having to remember. Let’s fix that:

In [56]:
names = ['Emily', 'Shraddha', 'Anne', 'Jess', 'Riley', 'Nicola']
friends_names = np.asarray([names, ages, heights]).transpose()

pprint(friends_names)

array([['Emily', '17', '165'],
       ['Shraddha', '17', '170'],
       ['Anne', '18', '169'],
       ['Jess', '19', '150'],
       ['Riley', '16', '167'],
       ['Nicola', '20', '171']], dtype='<U8')


Now that we have names in our data, we can’t use ‘np.mean’ on all of the columns, since the first column isn’t full of numbers:

In [57]:
np.mean(friends_names, axis=0)

TypeError: cannot perform reduce with flexible type

Recall in Python Lists, we can using "slicing" notation to access elements with [start:stop] notation. 
Since we're working with two dimensional arrays now, we need to modify our slicing notation to the following: 

[row_start:row_stop, col_start:col_stop]

Remember that python starts counting at 0 instead of 1. We want to include every row, so we will use ‘:’ which translates to start at the beginning row and go until the last row, then a list of columns we want second (after the comma):

In [70]:
# we include the .astype(int) to make sure our slices contain the correct dtype
np.mean(friends_names[:,[1,2]].astype(int), axis=0)

array([ 17.83333333, 165.33333333])

Numpy has a long list of functions other than just mean. If you look at the documentation (https://numpy.org/doc/) you'll find *all* of numpys functions :-) 

#### Using Pandas

Pandas is another library that was built to work with numpy. Importantly, Pandas lets you organize and work with data that has mixed types like strings, booleans, and numbers. Remember how we had to cast the friends_names array earlier? Pandas doesn't need to do this. Let’s load our data into pandas:

In [78]:
import pandas as pd

# we can specify a list of column names we want, just to keep us organized
friends_pd = pd.DataFrame(friends_names, columns=['Names', 'Ages', 'Heights'])
pprint(friends_pd)

      Names Ages Heights
0     Emily   17     165
1  Shraddha   17     170
2      Anne   18     169
3      Jess   19     150
4     Riley   16     167
5    Nicola   20     171


Cool! Notice how pandas was automatically able to figure out the columns and the rows? This looks a lot tidier than it did with numpy! Pandas works well with long-form data, where each column is a variable and each row is an observation. 

With pandas dataframes each column is considered a separate series, which is similar to a python list. You can think about a pandas dataframe as a collection of lists.

One of the biggest advantages of using pandas instead of a bunch of numpy lists is that the library comes with a lot of really fast and useful functions pre-built. When you have a lot of data to work with, speed and efficiency of your functions becomes very important.

Let's look at accessing one column and one row: 

In [97]:
# iloc -> "integer location", refers to the first (0th) row
pprint(friends_pd.iloc[0])
print('\n')

# not using iloc, you're indexing a column
pprint(friends_pd['Heights'])
print('\n')

# no casting necessary!
pprint(np.mean(friends_pd['Heights'].apply(lambda x: int(x))))


Names      Emily
Ages          17
Heights      165
Name: 0, dtype: object


0    165
1    170
2    169
3    150
4    167
5    171
Name: Heights, dtype: object


165.33333333333334


Let’s work with some other data. Download this file pets.csv and save it to the computer desktop (you can download and read datasets from anywhere on your computer, but for today let’s use the desktop folder). The dataset is stored as a ‘.csv’ file, which stands for ‘comma-separated values’. This means commas are used to separate the columns. The first few rows look like this:

In [98]:
PATH = '/Users/chantal/Desktop/HERCC_ds/menu_cost_prediction/pets.csv'
pets = pd.read_csv(PATH)
pprint(pets)

  Animal           Name  Age  Weight  Color
0    Dog          Angel   12      99  black
1    Dog          Matty    3      15  brown
2    Dog  Michaelangelo    9     120  white
3    Cat        Cupcake   15      10   grey
4    Cat          Louis    6      12  brown


We use special commands that let the computer know that the columns are separated by commas, so when we load it into python, the commas are removed and it looks nice and clean.

Now everything is nice and lined up, but there are numbers on the left that weren’t in the file. This ‘index’ is added by pandas to make it easier to see what row you’re working with, Notice how the column names (Animal, Name, Age, Weight, Color) do not have a number. This is because the default for pd.read_csv is to assume that the very first row in the file is the column names. If we had a dataset that didn’t have column names, we would have to specify this:


In [102]:
pets_no_names = pd.read_csv(PATH, header=None)
pprint(pets_no_names)

        0              1    2       3      4
0  Animal           Name  Age  Weight  Color
1     Dog          Angel   12      99  black
2     Dog          Matty    3      15  brown
3     Dog  Michaelangelo    9     120  white
4     Cat        Cupcake   15      10   grey
5     Cat          Louis    6      12  brown


Now the column names are numbered as row 0, and each column is numbered as well. If we tried to do analysis on our data like this, we would probably get a lot of error messages because the columns are now a mix of numbers and characters. It’s important to check your data after you loaded it to make sure it looks ok. If you know a bit about your data beforehand, you can check the shape of it to see if it matches what you expect (or if you don’t know what to expect, you can find out!):

In [103]:
pets.shape

(5, 5)

The first value is the number of rows and the second value is the number of columns.

It might be easier to think about our data if we label the rows with the names of the pets instead of numbers. We can do this with any column, but it’s best to use columns where each value is unique. That way we can specify exactly which row we are interested in. Let’s change the row indexing:


In [106]:
pets_named = pets.set_index('Name')
pprint(pets_named)

              Animal  Age  Weight  Color
Name                                    
Angel            Dog   12      99  black
Matty            Dog    3      15  brown
Michaelangelo    Dog    9     120  white
Cupcake          Cat   15      10   grey
Louis            Cat    6      12  brown


The Name column has been removed, and now the rows are labeled with the appropriate names. If we wanted to look at just the details about Matty, we can do this like so:

In [108]:
pets_named.loc['Matty', :]

Animal      Dog
Age           3
Weight       15
Color     brown
Name: Matty, dtype: object

The ‘loc’ function selects the data at the specified location, so in this case, at the row named Matty and all of the columns (indicated by ‘:’). We can grab a column of data in a similar fashion:

In [109]:
pets_named.loc[:, 'Weight']

Name
Angel             99
Matty             15
Michaelangelo    120
Cupcake           10
Louis             12
Name: Weight, dtype: int64

This gave us the values of the weight of each pet. If we wanted to be more specific and say we’re only interested in finding out the weight of Matty, we could do that like this:


In [110]:
pets_named.loc['Matty', 'Weight']

15

If we didn’t name the rows or columns of our data, we can use a similar function ‘iloc’ that uses the numbered indexes. First, lets double check the row and column numbers of where Matty’s weight is stored:

In [112]:
print(pets_no_names)

# We want row number 2 and column number 3:

pets_no_names.iloc[2, 3]


        0              1    2       3      4
0  Animal           Name  Age  Weight  Color
1     Dog          Angel   12      99  black
2     Dog          Matty    3      15  brown
3     Dog  Michaelangelo    9     120  white
4     Cat        Cupcake   15      10   grey
5     Cat          Louis    6      12  brown


'15'

Let’s add some new columns to our dataset. We will make a list of which family each pet belongs to and attach them to our existing dataframe by creating a new column name

In [113]:
pets_named['Family'] = ['Smith', 'Liu', 'Douglas', 'Khan']

ValueError: Length of values does not match length of index

Whoops, we got an error! When we add new rows or columns, they have to match the dimensions of the dataframe. We forgot to include Louise’s family!


In [116]:
pets_named['Family'] = ['Smith', 'Liu', 'Douglas', 'Khan', 'Garcia']

That’s better. Let’s also add how long each family has had their pet:

In [117]:
years_owned = [10, 3, 8, 2, 4]
pets_named['Years_owned'] = years_owned

We can also create new columns based on information in existing columns. We can add a column for how old each pet was when it was adopted by each family by subtracting the Years_Owned column from the Age column using the ‘assign’ function:


In [118]:
pets_updated = pets_named.assign(Adoption_age = pets_named['Age'] - pets_named['Years_owned'])

If we no longer need a column and want to keep the size of our dataframe nice and manageable, we can remove columns with the ‘drop’ function:


In [119]:
pets_less = pets_updated.drop(columns='Color')
pprint(pets_less)


              Animal  Age  Weight   Family  Years_owned  Adoption_age
Name                                                                 
Angel            Dog   12      99    Smith           10             2
Matty            Dog    3      15      Liu            3             0
Michaelangelo    Dog    9     120  Douglas            8             1
Cupcake          Cat   15      10     Khan            2            13
Louis            Cat    6      12   Garcia            4             2


#### Using matplotlib

#### Other useful Libraries: collections, tqdm, Scikit-learn

Before we continue, there are a few other cool libraries we can make use of in our data science projects. I'll demonstrate them briefly here:

In [156]:
# collections provides faster implementations of structures than the default python
import timeit

# deque structure
print(timeit.timeit(stmt='d.pop()',setup='from collections import deque; \
                                          d = deque(range(100000))',number=99999))

# list structure
print(timeit.timeit(stmt='l.pop(0)',setup='l = list(range(100000))',number=99999))

0.011967551000452659
1.4199160369998935


In [127]:
# tqdm is a cool progress bar wrapper around your for-loops 
from tqdm import tqdm

for i in tqdm(range(1000000)):
    continue


  0%|          | 0/1000000 [00:00<?, ?it/s][A
 15%|█▍        | 146023/1000000 [00:00<00:00, 1460213.99it/s][A
 29%|██▊       | 285867/1000000 [00:00<00:00, 1441112.42it/s][A
 48%|████▊     | 477695/1000000 [00:00<00:00, 1557326.28it/s][A
 67%|██████▋   | 667380/1000000 [00:00<00:00, 1645693.29it/s][A
 86%|████████▋ | 864830/1000000 [00:00<00:00, 1732231.44it/s][A
  2%|▏         | 19759945/1000000000 [00:20<06:16, 2606996.39it/s]

Scikit-Learn is a popular library for many machine learning and data science applications. We'll cover this library in depth in the Linear Regression section, but for now, you can take a look at the documentation page: 

https://scikit-learn.org/stable/

## Working with Lists (... Beyond the Basics)

#### NOTE: cover string.split() in the refresher section!!!!! and working with strings in general, maybe the reading and writing to files as well???

List comprehensions are a cool, pythonic way of quickly manipulating lists in python. They basically take a for loop and condense it into one line. Let's take a look at how to jump between a for loop and a list comprehension:

In [166]:

''' general structure for list comprehensions: [*modify x* for *each x* in my_list] '''

my_list_loop = [1, 2, 3, 4, 5, 6, 7]
my_list_comp = [1, 2, 3, 4, 5, 6, 7]

# lets add one to each element with a for loop: 

# recall that enumerate returns the index AND the value in the list
for index, num in enumerate(my_list_loop):
    my_list_loop[index] = num+1

pprint(my_list)

# now lets do it using list comprehensions
my_list_comp = [num+1 for num in my_list_comp]

pprint(my_list)


[3, 4, 5, 6, 7, 8, 9]
[3, 4, 5, 6, 7, 8, 9]


## Data Science Basics: Introducing Linear Regression