## Course Description
Python is a general-purpose programming language that is becoming ever more popular for data science. Companies worldwide are using Python to harvest insights from their data and gain a competitive edge. Unlike other Python tutorials, this course focuses on Python specifically for data science. In our Introduction to Python course, you’ll learn about powerful ways to store and manipulate data, and helpful data science tools to begin conducting your own analyses.
### Credit: 
* Filip Schouwenaars
* Patrick Varilly
* Vincent Vankrunkelsven
* Hugo Bowne-Anderson

## Module 1. Python Basics
An introduction to the basic concepts of Python. Learn how to use Python interactively and by using a script. Create your first variables and acquaint yourself with Python's basic data types.

### Python Features
1. **General purpose**: build anything
2. **Open source!** Free!
3. Python packages, also for data science
  * Many applications and fields
  
Python is a versatile language. Some go to applications of Python are listed below.
* You want to do some quick calculations.
* For your new business, you want to develop a database-driven website.
* Your boss asks you to clean and analyze the results of the latest satisfaction survey.

In [1]:
# Example, do not modify!
print(5 / 8)

# Print the sum of 7 and 10
print(7 + 10)

0.625
17


### Any comments?
To add comments to your Python script, you can use the # tag. These comments are not run as Python code, so they will not influence your result.

In [2]:
# Division
print(5 / 8)

# Addition
print(7 + 10)

0.625
17


### Python as a calculator
Python is perfectly suited to do basic calculations. Apart from addition, subtraction, multiplication and division, there is also support for more advanced operations such as:

1. **Exponentiation: **.** This operator raises the number to its left to the power of the number to its right. For example 4**2 will give 16.
2. **Modulo: %.** This operator returns the remainder of the division of the number to the left by the number on its right. For example 18 % 7 equals 4.

### Variables and Types
#### Variable
Variable are important for reproducibility
* Specic, case-sensitive name
* Call up value through variable name

#### Variable Assignment
In Python, a variable allows you to refer to a value with a name. To create a variable use =, like this example:

x = 5
<br>You can now use the name of this variable, x, instead of the actual value, 5.</br>

Remember, = in Python means assignment and called **assignement operator**, it doesn't test equality!

#### Python Types
1. Float
2. Integer
3. String
4. Boolean

In [3]:
# Create a variable savings
savings = 100

# Print out savings
print(savings)

100


In [4]:
# Create a variable savings
savings = 100

# Create a variable growth_multiplier
growth_multiplier = 1.1

# Calculate result
result = savings * (growth_multiplier ** 7)

# Print out result
print(result)

194.87171000000012


### Other variable types
In the previous exercise, you worked with two Python data types:

1. **int, or integer:** a number without a fractional part. savings, with the value 100, is an example of an integer.
2. **float, or floating point:** a number that has both an integer and fractional part, separated by a point. growth_multiplier, with the value 1.1, is an example of a float.
Next to numerical data types, there are two other very common data types:

3. **str, or string:** a type to represent text. You can use single or double quotes to build a string.
4. **bool, or boolean:** a type to represent logical values. Can only be True or False (the capitalization is important!)

In [5]:
# Create a variable desc
desc = "compound interest"

# Create a variable profitable
profitable = True

### Guess the type
To find out the type of a value or a variable that refers to that value, you can use the type() function. Suppose you've defined a variable a, but you forgot the type of this variable. To determine the type of a, simply execute:

**type(a)**

In [6]:
# Operations with other types
savings = 100
growth_multiplier = 1.1
desc = "compound interest"

# Assign product of growth_multiplier and savings to year1
year1 = savings * growth_multiplier

# Print the type of year1
print(type(year1))

# Assign sum of desc and desc to doubledesc
doubledesc = desc + desc


# Print out doubledesc
print(doubledesc)

<class 'float'>
compound interestcompound interest


Notice how desc + desc causes "compound interest" and "compound interest" to be pasted together.

In [8]:
# Type conversion

# Definition of savings and result
savings = 100
result = 100 * 1.10 ** 7

# Fix the printout
print("I started with $" + str(savings) + " and now have $" + str(result) + ". Awesome!")

# Definition of pi_string
pi_string = "3.1415926"

# Convert pi_string into float: pi_float
pi_float = float(pi_string)
print(pi_float)

I started with $100 and now have $194.87171000000012. Awesome!
3.1415926


## Module 2. Python Lists
Learn to store, access, and manipulate data in lists: the first step toward efficiently working with huge amounts of data.
Create a list
As opposed to int, bool etc., a list is a compound data type; you can group values together:

<br>a = "is"<br>
<br>b = "nice"<br>
<br>my_list = ["my", "list", a, b]<br>
After measuring the height of your family, you decide to collect some information on the house you're living in. The areas of the different parts of your house are stored in separate variables for now, as shown in the script.

In [1]:
# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# Create list areas
areas = [hall, kit, liv, bed, bath]

# Print areas
print(areas)


[11.25, 18.0, 20.0, 10.75, 9.5]


### Create list with different types
A list can contain any Python type. Although it's not really common, a list can also contain a mix of Python types including strings, floats, booleans, etc.

The printout of the previous exercise wasn't really satisfying. It's just a list of numbers representing the areas, but you can't tell which area corresponds to which part of your house.

In [2]:
# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# Adapt list areas
areas = ["hallway", hall, "kitchen", kit, "living room", liv, "bedroom", bed, "bathroom", bath]

# Print areas
print(areas)

['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5]


In [3]:
print([1, 3, 4, 2])
print([[1, 2, 3], [4, 5, 7]])
print([1 + 2, "a" * 5, 3])

[1, 3, 4, 2]
[[1, 2, 3], [4, 5, 7]]
[3, 'aaaaa', 3]


### List of lists

In [4]:
# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# house information as list of lists
house = [["hallway", hall],
         ["kitchen", kit],
         ["living room", liv],
         ["bedroom", bed],
         ["bathroom", bath]]

# Print out house
print(house)

# Print out the type of house
print(type(house))

[['hallway', 11.25], ['kitchen', 18.0], ['living room', 20.0], ['bedroom', 10.75], ['bathroom', 9.5]]
<class 'list'>


### Subset and conquer
Subsetting Python lists is a piece of cake. Take the code sample below, which creates a list x and then selects "b" from it. Remember that this is the second element, so it has index 1. You can also use negative indexing.

In [5]:
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]

# Print out second element from areas
print(areas[1])

# Print out last element from areas
print(areas[-1])

# Print out the area of the living room
print(areas[5])

11.25
9.5
20.0


### Subset and calculate
After you've extracted values from a list, you can use them to perform additional calculations. Take this example, where the second and fourth element of a list x are extracted. The strings that result are pasted together using the + operator:

In [6]:
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]

# Sum of kitchen and bedroom area: eat_sleep_area
eat_sleep_area = areas[7] + areas[3]

# Print the variable eat_sleep_area
print(eat_sleep_area)

28.75


### Slicing and dicing
Selecting single values from a list is just one part of the story. It's also possible to slice your list, which means selecting multiple elements from your list. Use the following syntax:

### Replace list elements
Replacing list elements is pretty easy. Simply subset the list and assign new values to the subset. You can select single elements or you can change entire list slices at once.

In [10]:
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]

# Correct the bathroom area
areas[-1] = 10.50

# Change "living room" to "chill zone"
areas[4] = "chill zone"

### Extend a list
If you can change elements in a list, you sure want to be able to add elements to it, right? You can use the + operator:

In [11]:
# Create the areas list and make some changes
areas = ["hallway", 11.25, "kitchen", 18.0, "chill zone", 20.0,
         "bedroom", 10.75, "bathroom", 10.50]

# Add poolhouse data to areas, new list is areas_1
areas_1 = areas + ["poolhouse", 24.5]

# Add garage data to areas_1, new list is areas_2
areas_2 = areas_1 + ["garage", 15.45]

## Module 3. Functions and Packages
You'll learn how to use functions, methods, and packages to efficiently leverage the code that brilliant Python developers have written. The goal is to reduce the amount of code you need to solve challenging problems!

Functions are the piece of reusable code designed to make developer's life easy.

### Familiar functions
Out of the box, Python offers a bunch of built-in functions to make your life as a data scientist easier. You already know two such functions: print() and type(). You've also used the functions str(), int(), bool() and float() to switch between data types. These are built-in functions as well.

Calling a function is easy. To get the type of 3.0 and store the output as a new variable, result, you can use the following:

In [1]:
# Create variables var1 and var2
var1 = [1, 2, 3, 4]
var2 = True

# Print out type of var1
print(type(var1))

# Print out length of var1
print(len(var1))

# Convert var2 to an integer: out2
out2 = int(var2)
print(out2)

<class 'list'>
4
1


The len() function is extremely useful; it also works on strings to count the number of characters.
### Help!
Maybe you already know the name of a Python function, but you still have to figure out how to use it. Ironically, you have to ask for information about a function with another function: help(). In IPython specifically, you can also use ? before the function name.

To get help on the max() function, for example, you can use one of these calls:

In [2]:
help(complex)

Help on class complex in module builtins:

class complex(object)
 |  complex(real=0, imag=0)
 |  
 |  Create a complex number from a real part and an optional imaginary part.
 |  
 |  This is equivalent to (real + imag*1j) where imag defaults to 0.
 |  
 |  Methods defined here:
 |  
 |  __abs__(self, /)
 |      abs(self)
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __divmod__(self, value, /)
 |      Return divmod(self, value).
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __float__(self, /)
 |      float(self)
 |  
 |  __floordiv__(self, value, /)
 |      Return self//value.
 |  
 |  __format__(...)
 |      complex.__format__() -> str
 |      
 |      Convert to a string according to format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getnewargs__(...)
 |  
 |  __gt__(self, value, /)
 | 

complex() takes two arguments: real and imag. Both these arguments are required.
### Multiple arguments
In the previous exercise, the square brackets around imag in the documentation showed us that the imag argument is optional. But Python also uses a different way to tell users about arguments being optional.

Have a look at the documentation of sorted() by typing help(sorted) in the IPython Shell.

You'll see that sorted() takes three arguments: iterable, key and reverse.

key=None means that if you don't specify the key argument, it will be None. reverse=False means that if you don't specify the reverse argument, it will be False.

In [3]:
# Create lists first and second
first = [11.25, 18.0, 20.0]
second = [10.75, 9.50]

# Paste together first and second: full
full = first + second

# Sort full in descending order: full_sorted
full_sorted = sorted(full, reverse = True)

# Print out full_sorted
print(full_sorted)

[20.0, 18.0, 11.25, 10.75, 9.5]


### Methods
Methods are the functions specific to the objects.
#### String Methods
Strings come with a bunch of methods. Follow the instructions closely to discover some of them. If you want to discover them in more detail, you can always type help(str) in the IPython Shell.

In [4]:
# string to experiment with: place
place = "poolhouse"

# Use upper() on place: place_up
place_up = place.upper()

# Print out place and place_up
print(place)
print(place_up)

# Print out the number of o's in place
print(place.count('o'))

poolhouse
POOLHOUSE
3


#### List Methods
Strings are not the only Python types that have methods associated with them. Lists, floats, integers and booleans are also types that come packaged with a bunch of useful methods. In this exercise, you'll be experimenting with:

1. **index()**, to get the index of the first element of a list that matches its input and
2. **count()**, to get the number of times an element appears in a list.

In [6]:
# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Print out the index of the element 20.0
print(areas.index(20.0))

# Print out how often 9.50 appears in areas
print(areas.count(9.50))

2
1


#### List Methods (2)
Most list methods will change the list they're called on. Examples are:

1. **append()**, that adds an element to the list it is called on,
2. **remove()**, that removes the first element of a list that matches the input, and
3. **reverse()**, that reverses the order of the elements in the list it is called on.

### Packages
#### Import package
As a data scientist, some notions of geometry never hurt. Let's refresh some of the basics.

For a fancy clustering algorithm, you want to find the circumference, C, and area, A, of a circle. When the radius of the circle is r, you can calculate C and A as:

<br>C=2πr</br>
<br>A=πr2</br>
<br>To use the constant pi, you'll need the math package.</br>

#### Selective import
General imports, like import math, make all functionality from the math package available to you. However, if you decide to only use a specific part of a package, you can always make your import more selective:

from math import pi
Let's say the Moon's orbit around planet Earth is a perfect circle, with a radius r (in km) that is defined in the script.

In [7]:
# Definition of radius
r = 192500

# Import radians function of math package
from math import radians

# Travel distance of Moon over 12 degrees. Store in dist.
dist = r* radians(12)

# Print out dist
print(dist)

40317.10572106901


## Module 4. Numpy
NumPy is a fundamental Python package to efficiently practice data science. Learn to work with powerful tools in the NumPy array, and get started with data exploration.
### Lists Recap
* Powerful
* Collection of values
* Hold different types
* Change, add, remove

But there are few limitations with list
* Mathematical operations over collections
* Speed

#### Your First NumPy Array
In this chapter, we're going to dive into the world of baseball. Along the way, you'll get comfortable with the basics of numpy, a powerful package to do data science.

A list baseball has already been defined in the Python script, representing the height of some baseball players in centimeters. Can you add some code here and there to create a numpy array from it?

In [8]:
# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]

# Import the numpy package as np
import numpy as np

# Create a numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out type of np_baseball
print(type(np_baseball))

<class 'numpy.ndarray'>


#### Baseball players' height
You are a huge baseball fan. You decide to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on more than a thousand players, which is stored as a regular Python list: height_in. The height is expressed in inches. Can you make a numpy array out of it and convert the units to meters?

In [52]:
# height is available as a regular list

# Import packages
import numpy as np
import pandas as pd

# import dataframe
baseball = pd.read_csv("baseball.csv")

# select heights of player
height_in = baseball["Height"].values

# Create a numpy array from height_in: np_height_in
np_height_in = np.array(height_in)

# Print out np_height_in
print(np_height_in)

# Convert np_height_in to m: np_height_m
np_height_m = np_height_in * 0.0254

# Print np_height_m
print(np_height_m)

[74 74 72 ... 75 75 73]
[1.8796 1.8796 1.8288 ... 1.905  1.905  1.8542]


#### Baseball player's BMI
The MLB also offers to let you analyze their weight data. Again, both are available as regular Python lists: height_in and weight_lb. height_in is in inches and weight_lb is in pounds.

It's now possible to calculate the BMI of each baseball player. 

In [53]:
# Create array from weight_lb with metric units: np_weight_kg
np_weight_lb = baseball['Weight'].values
np_weight_kg =  np_weight_lb * 0.453592

# Calculate the BMI: bmi
bmi = np_weight_kg / (np_height_m ** 2)


# Print out bmi
print(bmi)

[23.11037639 27.60406069 28.48080465 ... 25.62295933 23.74810865
 25.72686361]


#### Lightweight baseball players

In [54]:
# Create the light array
light = bmi < 21

# Print out light
print(light)

# Print out BMIs of all baseball players whose BMI is below 21
print(bmi[light])

[False False False ... False False False]
[20.54255679 20.54255679 20.69282047 20.69282047 20.34343189 20.34343189
 20.69282047 20.15883472 19.4984471  20.69282047 20.9205219 ]


#### Numpy Side Effects
1. Numpy is great for doing vector arithmetic. If you compare its functionality with regular Python lists, however, some things have changed. First of all, numpy arrays cannot contain elements with different types. If you try to build such a list, some of the elements' types are changed to end up with a homogeneous list. This is known as **type coercion**.

3. Second, the typical arithmetic operators, such as +, -, * and / have a different meaning for regular Python lists and numpy arrays.

#### Subsetting NumPy Arrays
You've seen it with your own eyes: Python lists and numpy arrays sometimes behave differently. Luckily, there are still certainties in this world. For example, subsetting (using the square bracket notation on lists or arrays) works exactly the same. To see this for yourself, try the following lines of code in the IPython Shell:

In [55]:
# Print out the weight at index 50
print(np_weight_lb[50])

# Print out sub-array of np_height_in: index 100 up to and including index 110
print(np_height_in[100:111])

200
[73 74 72 73 69 72 73 75 75 73 72]


### 2D NumPy Array
#### Your First 2D NumPy Array
Before working on the actual MLB data, let's try to create a 2D numpy array from a small list of lists.

In this exercise, baseball is a list of lists. The main list contains 4 elements. Each of these elements is a list containing the height and the weight of 4 baseball players, in this order. baseball is already coded for you in the script.

#### Baseball data in 2D form
You have another look at the MLB data and realize that it makes more sense to restructure all this information in a 2D numpy array. This array should have 1015 rows, corresponding to the 1015 baseball players you have information on, and 2 columns (for height and weight).

The MLB was, again, very helpful and passed you the data in a different structure, a Python list of lists. In this list of lists, each sublist represents the height and weight of a single baseball player. The name of this embedded list is baseball.

Can you store the data as a 2D array to unlock numpy's extra functionality?

In [69]:
# create 2 D list
baseball_list = np.array(list(zip(np_height_in,np_weight_lb)))

# Print out the shape of np_baseball
print(baseball_list.shape)

(1015, 2)


In [68]:
baseball_list

array([[ 74, 180],
       [ 74, 215],
       [ 72, 210],
       ...,
       [ 75, 205],
       [ 75, 190],
       [ 73, 195]], dtype=int64)

#### Subsetting 2D NumPy Arrays

In [63]:
# Print out the 50th row of np_baseball
print(baseball_list[49,:])

# Select the entire second column of np_baseball: np_weight_lb
np_weight_lb = baseball_list[:,1]

# Print out height of 124th player
print(baseball_list[123,0])

[ 70 195]
75


#### 2D Arithmetic

In [72]:
age = baseball["Age"].values
baseball_list = np.array(list(zip(np_height_in,np_weight_lb, age)))

# Create numpy array: conversion
conversion = np.array([0.0254, 0.453592, 1])

# Print out product of np_baseball and conversion
print(baseball_list*conversion)

[[ 1.8796  81.64656 22.99   ]
 [ 1.8796  97.52228 34.69   ]
 [ 1.8288  95.25432 30.78   ]
 ...
 [ 1.905   92.98636 25.19   ]
 [ 1.905   86.18248 31.01   ]
 [ 1.8542  88.45044 27.92   ]]


### Numpy: Basic Statistics
#### Average versus median

In [75]:
# Create np_height_in from np_baseball
np_height_in = baseball_list[:,0]

# Print out the mean of np_height_in
print("Average height: ", np.mean(np_height_in))

# Print out the median of np_height_in
print("Average weight: ", np.median(np_height_in))

# Print out the standard deviation of np_height_in
print("Standard deviation: ", np.std(np_height_in))

# Print out the correlation of np_height_in
print("correlation between height and weight: ", np.corrcoef(np_height_in, baseball_list[:,1]))

Average height:  73.6896551724138
Average weight:  74.0
Standard deviation:  2.312791881046546
correlation between height and weight:  [[1.         0.53153932]
 [0.53153932 1.        ]]


It's always a good idea to check both the median and the mean, to get an idea about the overall distribution of the entire dataset.

### Blend it all together
Blend it all together
In the last few exercises you've learned everything there is to know about heights and weights of baseball players. Now it's time to dive into another sport: soccer.

You've contacted FIFA for some data and they handed you two lists. The lists are the following:

<br>positions = ['GK', 'M', 'A', 'D', ...]</br>
<br>heights = [191, 184, 185, 180, ...]</br>
Each element in the lists corresponds to a player. The first list, positions, contains strings representing each player's position. The possible positions are: 'GK' (goalkeeper), 'M' (midfield), 'A' (attack) and 'D' (defense). The second list, heights, contains integers representing the height of the player in cm. The first player in the lists is a goalkeeper and is pretty tall (191 cm).

You're fairly confident that the median height of goalkeepers is higher than that of other players on the soccer field. Some of your friends don't believe you, so you are determined to show them using the data you received from FIFA and your newly acquired Python skills.

In [86]:
fifa = pd.read_csv("fifa.csv")
fifa.head()

Unnamed: 0,id,name,rating,position,height,foot,rare,pace,shooting,passing,dribbling,defending,heading,diving,handling,kicking,reflexes,speed,positioning
0,1001,Gábor Király,69,GK,191,Right,0,,,,,,,70.0,66.0,63.0,74.0,35.0,66.0
1,100143,Frederik Boi,65,M,184,Right,0,61.0,65.0,63.0,59.0,62.0,62.0,,,,,,
2,100264,Tomasz Szewczuk,57,A,185,Right,0,65.0,54.0,43.0,53.0,55.0,74.0,,,,,,
3,100325,Steeve Joseph-Reinette,63,D,180,Left,0,68.0,38.0,51.0,46.0,64.0,71.0,,,,,,
4,100326,Kamel Chafni,72,M,181,Right,0,75.0,64.0,67.0,72.0,57.0,66.0,,,,,,


In [96]:
# Convert positions and heights to numpy arrays: np_positions, np_heights
np_positions = fifa[' position'].values
np_heights = fifa[' height'].values


# Heights of the goalkeepers: gk_heights
gk_heights = np_heights[np_positions == " GK"]

# Heights of the other players: other_heights
other_heights = np_heights[np_positions != "GK"]

# Print out the median height of goalkeepers. Replace 'None'
print("Median height of goalkeepers: " + str(np.median(gk_heights)))

# Print out the median height of other players. Replace 'None'
print("Median height of other players: " + str(np.median(other_heights)))

Median height of goalkeepers: 188.0
Median height of other players: 182.0
