# Some Python Basics

## Variables

Variables in Python are dynamically typed, meaning that the type is inferred from assignment.  We typically don’t care what it is, only what it is able to do, this is often referred to as duck typing (if it acts like a duck and looks like a duck, its a duck).  Variables in python work more like removeable labels.  You can attach a 'var1' label to a number, then a string, then to a class, etc.

In the following code notice that None just means 'no value'.  Notice that we can do more than one assignment at once, this allows us some pretty useful functionality such as the variable swap below.

In [3]:
var1 = None
var1 = "I can be anything"
var1 = 5
var2 = 100
var1, var2 = var2, var1
print(var1, var2)

100 5


To get more inforation on a variables type, you can use either the type() function.

In [4]:
var1 = 100
var2 = "darkness"
print(type(var1), type(var2))

<class 'int'> <class 'str'>


We can also get input from users to fill our variables.

In [5]:
var1 = input("whats your name again? ")
print("Hello " + var1)

whats your name again?  Chelsea


Hello Chelsea


## Numbers
To start with we will talk about some basic data types (not variable types): numbers, booleans, and strings.  Numbers in Python are capable of floating and integer types.  All of the arithmetic operators work the same as c++, there are even additional operators for power and floor division.

In [6]:
var1 = 10
var2 = var1 ** 3
var3 = 34.32 // 4
print(var1, var2, var3, sep=", ")

10, 1000, 8.0


To make our code easier to read there are a few different ways we can write our numbers in python.  To make large numbers easier to read underscores can be used in place of commmas, these are ignored when the number is used.  We can also use scientific notation for floating point numbers, these multiply the float by a specified power of ten.

In [7]:
num = 1_000_000
print(num)

num = 9.435e-4
print(num)

1000000
0.0009435


## Booleans

There isn't much to say about booleans, as you might have noticed True and False are capitalized.  We haven't talked about containers or strings yet, but they will evaluate to false if they are empty.  You can also add them and they are interpreted as False = 0 and True = 1

In [8]:
var1 = True
var2 = False
var3 = var1 + var2
print(var1, var2, var3)

True False 1


## Strings
Strings in Python are created with ' or " and are immutable, if changes need to be made to a string a new one is returned.  The default encoding for strings in Python is Unicode UTF-8, this means that they are automatically compatible with different languages.  Python strings work similar to STL strings since they are classes with support functions built in, however in Python the amount of functionality is much larger.

In [9]:
var1 = 'hello ' + "world, " + "Python"
var1 = var1.upper()

print(var1)

HELLO WORLD, PYTHON


## List
In c++ choosing which container to use is actually very important (list, queue, stack, vector, array?), in Python this choice is simplified into a single container that has the functionality of everything.  To create a list use the square brackets [].  Notice that the types don't have to match, we don't care about variable types.   You can  create a list by calling the list() function, just like we did before the with the int() function.

In [10]:
var1 = [1, 2, 3, 4, 5.0, 6.0, True, False]

var1.append(123)
var1.pop(0)
print(var1)

[2, 3, 4, 5.0, 6.0, True, False, 123]


## Dictionary
The list is the container to use if you are storing single items, however if you need the functionality to store key/value pairs then the container to use is the dictionary (map, hash table).  The dictionary is created with curly braces {} and can also hold varying types, and of course idiomatic Python is true here too.  If you need to store only the keys you can omit the data portion and you will create a set.  Just like with the list, you can create a dictionary or set by calling the dict() or set() functions as well.

In [11]:
class_grades = {"joe":100, "mike":90, "stan":80}

class_grades["sarah"] = 95
print(class_grades['sarah'])
print(class_grades['mike'])

95
90


# Pandas
In the last lesson, we got to see Pandas in action by using it to make some visualizations of the Titanic data.  Let's take some time to explore some of the cool features of Python and Pandas.

## The History of Pandas

Origins:

* 2008: The Pandas project was started by Wes McKinney when he was working at AQR Capital Management. The main motivation was to have a flexible tool to perform quantitative analysis on financial data. The name "pandas" is derived from the term "panel data," a common term for data that involves observations over time.

Early Development:

* 2009: Wes McKinney released the first public version of pandas. The initial versions laid the foundation with data structures like Series and DataFrame, which have since become staples for data manipulation in Python.

Increasing Adoption:

* 2010s: As data science and Python grew in popularity during the 2010s, so did pandas. It quickly became one of the cornerstones of the scientific stack in Python alongside libraries like NumPy, SciPy, and Matplotlib.
The library received significant contributions from many developers worldwide, enhancing its capabilities and making it more robust.

Books and Documentation:

* 2012: Wes McKinney published "Python for Data Analysis," which prominently features pandas and its application in data analysis. This book played a crucial role in introducing many individuals to pandas and data analysis in Python.


Pandas is often seen as a gateway to data science in Python. Its simple yet powerful interface makes it a favorite for beginners and professionals alike.
With the rise of big data tools like Apache Spark, Dask, and Vaex, pandas also integrates with these tools, allowing users to scale their analyses when necessary.

## DataFrames and Series

The DataFrame is the primary structure we will be using for this class.  It is an associative, two dimensional data structure. Imagine a spreadsheet page,  SQL table, or flat file.  The series object is a one dimensional data structure that represents a single column of data.

We can manually create a DataFrame from dictionaries, lists, series, and much else.  We can also add new features to a DataFrame, or even combine multiple DataFrames.  If our data is provided to us we can read or write to a variety of different formats: CSV, Excel, SQL, JSON, URL, clipboard, etc.

A series object can be thought of as single column of a DataFrame.

## Loading and Information

To load a file into a DataFrame, we call the appropriate read function.  We can also easily export the DataFrame back out to a file with the to_csv function.  Let's read in and look at the first 5 observations in the Titanic dataset again.  Don't forget to import pandas first!

df = pd.read_csv('assets/titanic_passengers.csv')
df.head()

In [12]:
import pandas as pd

df = pd.read_csv('assets/titanic_passengers.csv')
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


A helpful feature for getting to know a DataFrame is `.info()`.  `.info()` tells us the name of each column, the number of non-null (not missing) observations, and the type of data (integer, floating point, or object).

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We can print a list of the columns of the DataFrame using `.columns`

In [14]:
df.columns

for col in df.columns:
    print(col)

PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked


We can see the number of rows and columns in the DataFrame using `.shape`.  The number of rows is shown first followed by the number of columns.

In [15]:
df.shape

(891, 12)

To get a information about the distribution of quantitative features, you can use `.describe()`.  Notice that, by default, only information is given for numeric values.

In [16]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


If you want to display the distribution of the categorical features in your data, you can add `include = object`.

In [17]:
df.describe(include='object')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


Another helpful feature for exploring your data chains together two Pandas methods, `.isnull()` and `.sum()`.  This will calculate the number of missing values for each feature in the DataFrame.  **Which feature has the largest amount of missing data?**

In [42]:
df.isnull().sum()

PassengerId        0
Survived           0
Pclass             0
Name               0
Sex                0
Age              177
SibSp              0
Parch              0
Ticket             0
Fare               0
Cabin            687
Embarked           2
Fare in USD        0
Survived_char      0
Fare_cat           0
dtype: int64

## Working with specific features

We can access specific features of a DataFrame by placing them in brackets after the name of the DataFrame.

In [43]:
df['Survived'].head(25)

0     0
1     1
2     1
3     1
4     0
5     0
6     0
7     0
8     1
9     1
10    1
11    1
12    0
13    0
14    0
15    1
16    0
17    1
18    0
19    1
20    0
21    1
22    1
23    1
24    0
Name: Survived, dtype: int64

We can access multiple features at the same time by placing the names of those features in a list (specified using `[]`) and putting the list of features in brackets after the name of the DataFrame.

In [21]:
df[['PassengerId', 'Survived', 'Pclass']].head()

Unnamed: 0,PassengerId,Survived,Pclass
0,1,0,3
1,2,1,1
2,3,1,3
3,4,1,1
4,5,0,3


## Feature Engineering

Feature engineering is the process of modifying our dataset to get it in a format that is most useful for data analysis.  Sometimes that means reformatting features or changing the units of a feature or changing the way data is represented.

Pandas makes feature engineering very easy because you can do arithmetic to an entire DataFrame or series.  For example, the fare on the Titanic is in British pounds.  Let's say we want to create a new feature in the DataFrame that is the fare in US dollars.  The current exchange rate is 1 GBP = 1.25 USD.  To convert any value from pounds to dollars, we multiply by 1.25.  We can do this to the entire DataFrame and save it as a new column using the `=`.


In [22]:
#Converting fares between GBP and USD

df['Fare in USD'] = df['Fare'] * 1.25

df[['Fare', 'Fare in USD']].head()

Unnamed: 0,Fare,Fare in USD
0,7.25,9.0625
1,71.2833,89.104125
2,7.925,9.90625
3,53.1,66.375
4,8.05,10.0625


Let's look at another example. The feature `Survived` is coded as 0 or 1.  If we were to model the Titanic data, we would need all of our features represented using numbers.  However, when we make visualizations, it might be helpful to have another variable - let's call it `Survived_char` - where `Survived` is coded as `Survived` or `Perished`.

It's very common to need to recode data in this way, so let's take the time to look at the steps in detail.
1. Establish the condition(s) you want to use to recode your feature.
2. Recode the feature using `.loc()`.

### Working with conditionals and `.loc()`

1. Establish the conditions(s) you want to use to recode your feature.

You can use any logical operator such as >, <, ==, !=, >=, <=, etc. as well as some special Pandas methods such as `.str.contains()`.

In [23]:
#Here we set up two conditions, one where Survived = 1 and one where Survived = 0.  We will create a new feature called `Survived_char` based on these two conditions.

cond_survived = df['Survived'] == 1
cond_perished = df['Survived'] == 0


2. Recode the feature using `.loc()`.

You'll do this using the format: `df.loc[condition, new feature] = new value`

In [24]:
#Coding "Survived_char" to be "Survived" and "Perished"

df.loc[cond_survived, 'Survived_char'] = "Survived"
df.loc[cond_perished, 'Survived_char'] = "Perished"

In [25]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare in USD,Survived_char
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,9.0625,Perished
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,89.104125,Survived
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,9.90625,Survived
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,66.375,Survived
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,10.0625,Perished


## Checking our work using `.value_counts()` and `.crosstab()`

We can use two more helpful Pandas methods `.value_counts()` and `.crosstab()` to check our work.  These methods also come in handy when displaying the distribution of one or two categorical features.

`.value_counts()` shows the names and counts of features in a DataFrame

In [26]:
#Freuqncy table of "Survived"

df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

We can also create a frequency table for more than one feature by placing them in a list inside the square brackets next to the name of the DataFrame

In [27]:
df[['Survived', 'Survived_char']].value_counts()

Survived  Survived_char
0         Perished         549
1         Survived         342
dtype: int64

Now let's make sure that each time `Surived = 0`, `Survived_char = 'Perished'` and each time `Survived = 1`, `Survived_char = 'Survived'`.  We can do this by making a cross-tabulation of the two features.  A cross-tabulation or cross-tab shows how individual records fall into the categories of two features at the same time.  We'll explore this idea more in the next lesson.  For now, we just want to make sure that we coded `Survived_char` correctly.

We use `pd.crosstab()` and enter the two features we want to look at the freqeuncy of together separated by a comma.

In [28]:
pd.crosstab(df['Survived'],df['Survived_char'])

Survived_char,Perished,Survived
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,549,0
1,0,342


**Does this seem correct to you?**

## Sorting, Grouping and Subsets

Here are three more Pandas methods that come in very handy when doing feature engineering.

### Sorting

We can sort by a specific feature or features using `sort_values()`.

In [45]:
#Age sorted youngest to oldest

sorted_df = df.sort_values(by='Age')
sorted_df.head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare in USD,Survived_char,Fare_cat
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C,10.645875,Survived,medium
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S,18.125,Survived,medium
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C,24.072875,Survived,medium
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C,24.072875,Survived,medium
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S,36.25,Survived,medium
831,832,1,2,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S,23.4375,Survived,medium
305,306,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,189.4375,Survived,high
827,828,1,2,"Mallet, Master. Andre",male,1.0,0,2,S.C./PARIS 2079,37.0042,,C,46.25525,Survived,high
381,382,1,3,"Nakid, Miss. Maria (""Mary"")",female,1.0,0,2,2653,15.7417,,C,19.677125,Survived,medium
164,165,0,3,"Panula, Master. Eino Viljami",male,1.0,4,1,3101295,39.6875,,S,49.609375,Perished,high


In [30]:
#Age sorted oldest to youngest

sorted_df = df.sort_values(by='Age', ascending=False)
sorted_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare in USD,Survived_char
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S,37.5,Survived
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S,9.71875,Perished
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,61.88025,Perished
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,43.31775,Perished
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q,9.6875,Perished
672,673,0,2,"Mitchell, Mr. Henry Michael",male,70.0,0,0,C.A. 24580,10.5,,S,13.125,Perished
745,746,0,1,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,71.0,B22,S,88.75,Perished
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S,13.125,Perished
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C,77.474,Perished
280,281,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q,9.6875,Perished


In [31]:
#Data sorted by age and sex

sorted_df = df.sort_values(by=['Age', 'Sex'])
sorted_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare in USD,Survived_char
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C,10.645875,Survived
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S,18.125,Survived
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C,24.072875,Survived
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C,24.072875,Survived
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S,36.25,Survived
831,832,1,2,"Richards, Master. George Sibley",male,0.83,1,1,29106,18.75,,S,23.4375,Survived
305,306,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,189.4375,Survived
172,173,1,3,"Johnson, Miss. Eleanor Ileen",female,1.0,1,1,347742,11.1333,,S,13.916625,Survived
381,382,1,3,"Nakid, Miss. Maria (""Mary"")",female,1.0,0,2,2653,15.7417,,C,19.677125,Survived
164,165,0,3,"Panula, Master. Eino Viljami",male,1.0,4,1,3101295,39.6875,,S,49.609375,Perished


### Grouping

We can use `.groupby()` to make calculations for specific groups separately.  For example, if we wanted to calculate the mean age for male and female passengers separately, we can use both the `.groupby()` and the `.mean()` methods.

In [32]:
#Look closely at the () and [] being used below

df.groupby('Sex')['Age'].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

**One average, which passengers were (slightly) older?  Male passengers or female passengers?**

We can groupby more than one feature.

In [33]:
#Calculating mean age by sex and passenger class

df.groupby(['Sex', 'Pclass'])['Age'].mean()

Sex     Pclass
female  1         34.611765
        2         28.722973
        3         21.750000
male    1         41.281386
        2         30.740707
        3         26.507589
Name: Age, dtype: float64

**Which sex/passenger class group was the oldest on average?  Which was the youngest on average?**

## Let's do one more example of feature engineering.  

Let's create a categorical feature called `Fare_cat` that categorizes passengers as paying a `low`, `medium` or `high` fare.  If a passenger paid less than 8 pounds, they paid a low fare.  If they paid between 8 and 31 pounds, it was a medium fare, and if they paid more than 31 pounds, it was a high fare.

In [47]:
#Note that there are other, fanicer ways to do this, but IMO this is the most straightforward way when you are just getting started

#Set up conditions
cond_low = df['Fare'] < 8
cond_med = (df['Fare'] >=8 ) & (df['Fare'] <= 31) #Note that we are setting up two conditions in parentheses using an & for "and"
cond_high = df['Fare'] > 31

#Code the new feature
df.loc[cond_low, 'Fare_cat'] = 'low'
df.loc[cond_med, 'Fare_cat'] = 'medium'
df.loc[cond_high, 'Fare_cat'] = 'high'

Now we can check our work using `.groupby()` and the `.min()` and `.max()` methods.

In [48]:
#Look at the miminum fare in each category
df.groupby('Fare_cat')['Fare'].min()

Fare_cat
high      31.2750
low        0.0000
medium     8.0292
Name: Fare, dtype: float64

In [49]:
#Look at the maximum fare in each category
df.groupby('Fare_cat')['Fare'].max()

Fare_cat
high      512.3292
low         7.9250
medium     31.0000
Name: Fare, dtype: float64

### Subsets

Finally, we can subset our data using a condition using `.loc()` similar to the way we performed our feature engineering earlier.  Let's select only female passengers.

In [55]:
#Set the condition
cond_female = df['Sex'] == 'female'

#Subset the data
female_passengers = df.loc[cond_female]
female_passengers.head(10)

#cond_expensive= df['Fare'] >= 200
#expensive_tickets = df.loc[cond_expensive]

#expensive_tickets

Martinez = df.loc[df['Cabin'] == 'B51 B53 B55']

Martinez


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare in USD,Survived_char,Fare_cat
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C,640.4115,Survived,high
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0,B51 B53 B55,S,6.25,Perished,low


In [38]:
female_passengers['Sex'].value_counts()

female    314
Name: Sex, dtype: int64

# OK
With these tools (Pandas series and DataFrames) we should be able effectively work with very large data sets.  This is an important part of our workload when we do anything related to data science, whether we are doing statistical analysis, or creating visualizations to gain insight on data, or (eventually) creading machine learning models to make predictions.  