<a href="https://colab.research.google.com/github/kayley-smiley/Python-Training/blob/main/Python_for_Beginners_Part_1_(Instructor_Version).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Python for Beginners: Part 1**

---------------------------------------------------------------------------

This training will introduce you to coding in Python and teach you some fundamental concepts for using Python for data analysis.

Specifically, we'll cover:

* data types and structures
* variables
* functions
* the pandas library
* importing a data set
* conducting exploratory data analysis


We will use Google Colab to execute the code for this training because it doesn't require any installation or setup before getting started. However, when using Python to analyze a data set containing confidential information, it is highly recommended that you do not use Google Colab because there is no guaranteed data privacy.


### **What is Python?**

* Python is a popular programming language in the data science and machine learning communities.
* Python can handle large amounts of data and be used to perform complex analysis.
* It is considered one of the more beginner-friendly programming languages.
* Python is "open source", which means that it's source code is made available for use and modification.
  * Using Python is free!
* It has a large community of developers, which means it's relatively easy to find tutorials and get answers to questions.
* Python's capabilities can be extended even further with it's extensive collection of libraries, which provide pre-written code.

### **Getting Started**

To conduct data analysis with Python, we will use what's called a "notebook." Notebooks contain a mixture of code and text, which make it fairly easy to present your work in an organized way, reproduce your work on other data sets, and share your work with others. You can add a block of code or text in two ways: by clicking + Code or + Text in the bar above or by hovering your mouse over the bottom of a current block and then clicking + Code or + Text. If you wanted to create a new notebook, you could do so by clicking File → New Notebook in the bar above.



### **Running Code**

When using Python, we can execute or run our code as soon as we write it, which makes it easy to test out ideas. To execute a code block, click the "play" button located on the left side of the code block. The output for the code will be displayed below the block.

In [None]:
1 + 1

2

In [None]:
10 - 2

8

The table below provides the symbols for some common operations.


| Symbol | Operation |
| --------- |: ----- |
| `+` | Addition |
| `-` | Subtraction |
| `*` | Multiplication |
| `/` | Division |
| `**` | Exponentiation |

**Practice**: Calculate $2^3$ in the code block below.

In [None]:
2**3

8

You can type multiple lines of code in the same code block, but output will only display for the last line of code. If you want to print multiple outputs, you can either use separate code blocks or you can use the built-in `print()` function.

It is common practice to use separate code blocks for separate topics.

In [None]:
#only the output for the last line prints
2**3
5+10

15

In [None]:
#printing both outputs
print(2**3)
print(5+10)

8
15


### **Comments**

Comments are short notes that you can place in code blocks. They are part of the code, but Python ignores them when the code is run. They are intended to provide clarity for people reading the code or prevent execution when testing code.

Comments start with the `#` symbol and everything typed to the right of the `#` symbol will be ignored. You can confirm what is included in the comment because the text will be green.

In [None]:
#this is a comment

In the code below, Python adds `5 + 2` and ignores everything to the right of `#`.



In [None]:
5 + 2 #adding 5 + 2

7

In [None]:
#the code below will run because it's on a new line
5 + 8

13

In [None]:
#we can also create comments that
#are multiple lines

### **Variables**

Variables are essentially containers for storing data. You can also think of them as names that you assign to objects. Variables make it easy to store and reference data in our code.

To create a variable, we use the `=` sign.


In [None]:
x = 3

After you've defined a variable, you can see what it equals by typing the variable name or using the `print()` function.

In [None]:
x

3

In [None]:
print(x)

3


There are several rules when it comes to naming variables:


* A variable name **cannot** contain spaces.
* The first character in the name must be either a letter or an underscore.
  * After the first character, you can use numbers too.
*  Uppercase and lowercase letters are distinct.
  * `a` and `A` are different variables.
* A variable name **cannot** be one of Python's key words, such  as `for`, `and`, `or`, `else`, and `in`.
* It is recommended that you choose names that are somewhat informative and short.

In Colab, clicking on the $\{x\}$ symbol to the left will provide a list of variables currently in use.

It is important to note that you can save over existing variables! We defined `x` as $3$ earlier, but if we run the code below, `x` is now $10$.



In [None]:
x = 10
x

10

**Practice**: Create two variables, `x` and `y`, equal to 5 and 10, respectfully. Then, create a new variable `z`that equals the sum of `x` and `y`, and print the value of `z`.

In [None]:
x = 5
y = 10
z = x + y

print(z)

15


### **Data Types**

Variables can store different types of data, which have unique purposes. The table below contains some frequently used data types. There are more data types, such as complex numbers, but we won't discuss these.

| Type | Description | Example |
| :--------- |:----- |:----|
| str | text/characters  | `'hello'` |
| int | integer/whole number | `1` |
| float | decimals | `1.2` |
| bool | logical value | `True`, `False` |
| list | ordered collection of data | `['apple', 'banana', 'cherry']` |
| set | unordered collection of data | `{'apple', 'banana', 'cherry'}` |


To check the data type for a variable, you can use the built-in `type()` function. Let's check the type of the variable, `x`, we defined earlier.

In [None]:
type(x)

int

In [None]:
type('hello')

str

In [None]:
type([1, 2, 3])

list

**Practice**: In the code block below, find the data type for `31.8`.

In [None]:
type(31.8)

float

**Practice**: In the code block below, find the data type for `{2, 4, 6}`.

In [None]:
type({2, 4, 6})

set

#### **Booleans**

Booleans (bool) represent one of two values: `True` or `False`.

What would we use the Boolean data type for? In programming, we often want to know if a statement is `True` or `False`.  

In [None]:
10 > 9

True

In [None]:
type(10 > 9)

bool

However, if we wanted to check for equality, we can't use `=` since that's how we assign variables. To check for equality, Python uses `==`. The table below provides the symbols for logical operators.

| Operator | Symbol |
| --------- |: ----- |
| equal to | `==`  |
| not equal to | `!=` |
| greater than | `>` |  
| less than | `<`  |
| greater than or equal to | `>=` |
| less than or equal to | `<=` |

In [None]:
1 == 1

True

We can also use logical operators along with boolean expressions. The `&` and `|` operators allow us to connnect multiple boolean expressions to create a compound expression.

* The `&` operator returns `True` if *all* sub-expressions are `True`. Otherwise, it will return `False`.
* The `|` operator returns `True` if *at least one* sub-expression is `True`.  


In [None]:
#checks if both expressions are true
(2 > 1) & (1 > 3)

False

In [None]:
#checks if at least one of the expressions is true
(2 < 1) | (1 < 2)

True

It is worth knowing that `True` and `False` are equivalent to `1` and `0`. This means that we can add Booleans. That might not seem very useful, but it will come in handy later on.

In [None]:
#this is equivalent to 1 + 0
True + False

1

**Practice**: In the code block below, create variables `x` and `y` that equal 1 and 1.0001, respectfully. Then, check to see if `x` is greater than or equal to `y`.

In [None]:
x = 1
y = 1.0001
x >= y

False

**Practice**: Create another variable `z` that equals 2. Write a statement that checks if either `x` is greater than `z` or `z` is less than `y`.

In [None]:
z = 2
(x > z) | (z < y)

True

#### **Lists and Sets**

Lists and sets look very similar, so what are the differences between the two?

First, lists are **ordered** and sets are not. For example, the set `{1, 2, 3}` is equal to the set `{2, 3, 1}` because they contain the same elements. However, the list `[1, 2, 3]` is not equal to the list `[2, 3, 1]` because even though they contain the same elements, they are not in the same order.

Just as we used logical operators to check if single elements are equal, we can also check if sets or lists are equal.









In [None]:
{1, 2, 3} == {2, 3, 1}

True

In [None]:
[1, 2, 3] == [2, 3, 1]

False

Because lists are ordered, they are also **indexed**. This means we can access elements within a list by referencing their index.

It's important to note here that Python begins indexing with 0. So, the "first" item in a list has index 0, not 1. To access an element by it's index, we put the index within brackets, `[]`, after the list name.

In [None]:
#define a list that contains cities in Colorado
cities = ['Denver', 'Lakewood', 'Centennial', 'Littleton', 'Aurora']

#access the first element in the list
cities[0]

'Denver'

We can also use negative indices, which will start at the end of the list instead of the beginning. This can be a convenient way to see the last item in a list if you're not sure how long the list is.

In [None]:
#access the last element in the list
cities[-1]

'Aurora'

We can access more than just one element at a time by giving a range of indices, such as `[0:2]`. However, in Python, the first number given in the range is inclusive and the last one is not. So, the range `[0:2]`, will provide the elements for indices `[0]` and `[1]`.

In [None]:
#access the first two elements in the list
cities[0:2]

['Denver', 'Lakewood']

If we leave off the starting index for a range, the range will start with the first item in the list by default. Similarily, if we leave off the ending index for a range, the range will end with the last item in the list by default.

In [None]:
#access all elements with an index of three or larger
cities[2:]

['Centennial', 'Littleton', 'Aurora']

Because the items in a set are unordered, they don't have indices, which means we can't refer to them by an index.

Lists and sets are both **mutable** collections of elements. An object is considered mutable if it's data or attributes can be altered after it's created.

We can alter a set by adding or removing items with the `add()` and `remove()` functions. However, the elements in a set are unchangable themselves (aside from removing them).







In [None]:
#define a set of states
states = {'California', 'Washington', 'Oregon', 'Nevada', 'Utah'}

#add 'Arizona' to the list
states.add('Arizona')

#view what elements are in 'states'
states

{'Arizona', 'California', 'Nevada', 'Oregon', 'Utah', 'Washington'}

In [None]:
#remove 'Washington' from the list
states.remove('Washington')

#view what elements are in 'states'
states

{'Arizona', 'California', 'Nevada', 'Oregon', 'Utah'}

We can alter a list by adding elements using the `append()` or `insert()` functions. The `append()` function will add an element to the end of the list and the `insert()` function will insert an element at a specified index.

In [None]:
#add 'Boulder' to the end of the cities list
cities.append('Boulder')

cities

['Denver', 'Lakewood', 'Centennial', 'Littleton', 'Aurora', 'Boulder']

In [None]:
#make 'Golden' the third element in the cities list
cities.insert(2, 'Golden')

cities

['Denver',
 'Lakewood',
 'Golden',
 'Centennial',
 'Littleton',
 'Aurora',
 'Boulder']

We can also alter a list by changing it's elements.

In [None]:
#replace 'Littleton', the third element, with 'Denver'
cities[4] = 'Denver'

cities

['Denver', 'Lakewood', 'Golden', 'Centennial', 'Denver', 'Aurora', 'Boulder']

Notice the `cities` list contains two `'Denver'` elements now. Lists can contain **duplicate items**, whereas sets cannot. If we try to add a duplicate item to a set, it gets removed, but if we do the same in a list, it stays.

In [None]:
{1, 2, 3, 2}

{1, 2, 3}

In [None]:
[1, 2, 3, 2]

[1, 2, 3, 2]

Lastly, if we want to check the length of a set or list, we can use the built-in `len()` function.

In [None]:
#check the length of the states set
len(states)

5

In [None]:
#check the length of the cities list
len(cities)

7

Although lists and sets often contain items of the same data type, there is no requirement that the items have the same data type. For example, the list below contains the string `'Denver'`, Denver's population, and the number of neighborhoods.

In [None]:
#define a list with multiple data types
denver = ['Denver', 2931000, 78]

#view the list
denver

['Denver', 2931000, 78]

The table below summarizes the similarities and differences between lists and sets.

| | Lists | Sets |
| --------- |: ----- |: ----- |
| Mutable | $\checkmark$ | $\checkmark$ |
| Ordered | $\checkmark$ |  |
| Allow for duplicate items | $\checkmark$  | |
| Indexed | $\checkmark$ | |
| Allow for different data types | $\checkmark$ | $\checkmark$ |

**Practice**: Using the `days` set defined in the code block below, remove the item that doesn't belong and add in an item that belongs.

In [None]:
days = {'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'January'}

In [None]:
days.remove('January')
days

{'Friday', 'Monday', 'Thursday', 'Tuesday', 'Wednesday'}

In [None]:
days.add('Saturday')
days

{'Friday', 'Monday', 'Saturday', 'Thursday', 'Tuesday', 'Wednesday'}

### **Conditional Statements**

Conditional statements in Python are used to execute code if a condition is met. To determine if a condition is met, we need to know if a statement is `True` or `False`.

The simplest conditional statement is the `if` statement, which has the following format.


```
if condition:
    statements to execute if condition is true
```

The indentation in conditional statements tells Python what code to run if the condition is met. Without the indentation, the `if` statement won't work as intended.


In [None]:
#if statement example
x = 5

if (x > 4):
    print("Greater than 4")

Greater than 4


The `if` statement executes code if a condition is met, but it doesn't do anything if the condition is not met. The `if-else` statement provides code to run in both scenarios and has the following format.

```
if (condition):
    statements to execute if condition is True
else:
    statements to execute if condition is False
```

In [None]:
x = 3
y = 5

if x >= y:
  print('x is greater than or equal to y')
else:
  print('x is smaller than y')

x is smaller than y


When you encounter a situation with more than two scenarios, you can use an `if-elif-else` statement. It's similar to the `if-else` statement, except you can place `elif` statements in between the `if` and `else` statements to account for several conditions. `if-elif-else` statements have the following format.

```
if (condition):
    statements to execute if condition is True
elif (condition):
    statements to execute if condition is True
else:
    statements to execute if condition is False
```

In [None]:
y = 0

if y > 0:
  print('Positive number')
elif y < 0:
  print('Negative number')
else:
  print('Zero')

Zero


### **Loops**

In programming, we often want to perform the same task multiple times. Instead of writing the same code over and over again, we can write the code once and place it inside of a loop, which will repeat the code as many times as we want. There are two types of loops in Python: `while` loops and `for` loops.

#### **While Loops**

As the name implies, `while` loops will repeat *while* a condition is true. They consist of two parts: a condition and code to execute while the condition is true. This structure sounds similar to conditional statements, but `while` loops have the capability to perform the same task multiple times, whereas `if` statements on their own do not. They are structured as follows.

```
while (condition):
    statements to execute if the condition is true
```

If the condition is true, the code within the `while` loop is executed. After the code is executed, the condition is checked again. If the condition is still true, the code is executed again. This is repeated until the condition is false.

In [None]:
x = 1

while x < 5:
  x = x + 1
  print(x)

2
3
4
5


`while` loops can be dangerous because we can accdientally create an infinite loop like the one below. This loop would never stop running because `x` is always greater than 1.

```
x = 2

while x > 1:
    x = x + 1
    print(x)
```

If you ever accdientally create an infinite loop, clicking Runtime → Interrupt session in the bar above will stop everything that's currently running.

#### **For Loops**

Instead of using a condition to determine whether code should be executed, `for` loops execute code a *fixed* number of times. They consist of two parts: an object to iterate over and code to execute for each iteration. `for` loops are structured as follows.

```
for item in iterable_object:
    statements to execute
```

In `for` loops, we can refer to the items in the iterable object as whatever variable name we want. For example, in the loop below, we refer to the items as `x`.


In [None]:
#define list containing numbers from 1 to 5
sequence = [1, 2, 3, 4, 5]

for x in sequence:
  print(x)

1
2
3
4
5


In the loop above, we iterated over a list of numbers from 1 to 5. But, what if we wanted to iterate over a longer list of numbers? It would be tedious to manuaully type out the list. Instead of doing that, we can utilize the built-in `range()` function, which produces a sequence of numbers in a given range. For example `range(1,10)` produces a sequence from 1 to 9, because the last item in the `range()` function is exclusive.


In [None]:
for x in range(1,10):
  print(x)

1
2
3
4
5
6
7
8
9


We don't have to increment the numbers by 1 though; for example, let's generate even numbers from 1 to 16.

In [None]:
#even numbers from 1 to 16
for x in range(2, 17, 2):
  print(x)


2
4
6
8
10
12
14
16


### **Functions**

We've already used a few functions, like `print()` or `type()`. In general, functions take inputs called arguments and usually produce something in return. The Python documentation provides a complete [list of built-in functions](https://docs.python.org/3/library/functions.html), and the table below includes some of the commonly used functions.

| Function | Description |
| --------- |: ----- |
| `print()` | prints the output |
| `sorted()` | sorts a list in ascending order |
| `type()` | returns the data type for an object |
| `abs()` | returns the absolute value of a number |
| `round()` | returns the number rounded to a specified number of decimal places |
| `sum()` | returns the sum of items in an object  |
| `len()` | returns the length of an object |
| `max()` | returns the largest item in an object |
| `min()` | returns the smallest item in an object |   

In [None]:
#define a list from 1 to 10
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [None]:
#add the elements in x together
sum(x)

55

In [None]:
#find the number of elements in x
len(x)

10

In [None]:
#find the smallest element in x
min(x)

1

We can also create our own custom functions. Functions can be extremely helpful if we've written code that will be used multiple times in the same script. Creating a function is a cleaner approach than just copying and pasting code.


Declaring a function consists of five parts:
1. The `def` keyword
2. The function name
  * The requirements for a function name are extremely similar to the variable name requirements. It's also recommended to avoid using the same name as an existing function, even though this is technically allowed.
3. Arguments
  * These are the values that are fed to the function.
4. The body of the function
5. A return statement
  * This is used to define the "output" of the function.
  * If you need to return more than one value from a  function, you can use a list.

The format for declaring a function is as follows:

```
def function name(arguments):
  body of the function
  return statement
```




Let's start with something simple and define a function that adds two numbers. Python has the addition operator, `+`, so this is just to demonstrate the format for defining a function.

In [None]:
def addition(num1, num2): #def keyword, function name, and arguments
  sum = num1 + num2       #body of the function
  return sum              #return statement

In [None]:
addition(1, 2)

3

**Practice**: Create a function named `summary` that accepts a list and returns the minimum, maximum, and length of the list. Check that your function works as expected by trying it out on the list, `y`, defined in the code block below. *Hint: You'll need to return a list in order to return the minimum, maximum, and length.*

In [None]:
#list to check function
y = [2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
def summary(list):
  minimum = min(list)
  maximum = max(list)
  length = len(list)
  return [minimum, maximum, length]

In [None]:
summary(y)

[2, 9, 8]

**Practice**: Create a `for` loop that prints the data type for each element in the `practice` list defined below.

In [None]:
practice = ['hello', 47, True, False, 100]

In [None]:
for x in practice:
  print(type(x))

### **Errors**

There are three types of errors that can occur when coding in Python: syntax errors, runtime errors, and logical errors. The term **debugging** is often used to describe the process of finding and correcting an error.

Syntax refers to the rules that define the structure of a programming language, so **syntax errors** occur when the proper syntax is not followed. Some examples of syntax errors are leaving out a comma or a bracket.





In [None]:
print('hello'

SyntaxError: incomplete input (<ipython-input-1-0b37b907169d>, line 1)

**Runtime errors** occur when the syntax is correct, but the program can't run for a different reason, like dividing by zero or trying to access an object that doesn't exist.

In [None]:
#the name of the set is 'states' not state
state

NameError: name 'state' is not defined

**Logical errors** are the most difficult to fix because there are no error messages. The code runs without any issues, but the result is incorrect due to flawed logic.

In [None]:
#define a list from 1 to 10
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

#find the sum of the first 5 elements (it should be 15)
sum(x[0:4])

10

The code above returned a value of 10 instead of the expected 15. This is because the last value in the range is exclusive.

Logical errors can often be fixed by double checking the results of intermediate steps in the code.

In [None]:
x[0:4] #only the first 4 elements, not 5

[1, 2, 3, 4]

### **Libraries**

Python has a lot of capabilities on its own, but we can do even more with the use of libraries. Libraries are collections of code and functions that extend the capabilities of Python. Some of the most popular libraries are pandas, NumPy, Plotly, and Matplotlib.

### **The pandas Library**

We'll start with the pandas library, which contains a variety of tools for doing things like:
* manipulating data
* reading and writing data
* slicing, indexing, and subsetting data
* aggregating or transforming data
* merging and joining data sets
* time series analysis

The pandas library also has a [user guide](https://pandas.pydata.org/docs/user_guide/index.html) that covers a wide variety of topics and how pandas can be used to approach data analysis problems.

To use a library in Python, we use an `import` statement. It is recommended to import all libraries at the beginning of the script or notebook. Although this isn't necessary, it's common to import libraries under an alias or an alternate shorter name. For example, the pandas library is often imported with an alias of `pd`. This means the library can now be referred to as `pd` instead of `pandas`.


In [2]:
import pandas as pd

The table below contains some functions from the pandas library; however, this is just scratching the surface of the capabilities of pandas. We'll use these functions in the following sections to perform some exploratory data analysis. The pandas [user guide](https://pandas.pydata.org/docs/user_guide/index.html) contains detailed documentation for all of the functions in the library.


| Function Name | Use |
| --------- |: ----- |
| `pd.read_csv()` | imports a csv file as a `DataFrame` |
| `DataFrame.info()` | provides a concise summary of a `DataFrame` |
| `DataFrame.dtypes` | provides the data types for each column in a `DataFrame` |
| `DataFrame.shape` | provides the number of rows and columns in a `DataFrame` |
| `DataFrame.describe()` | provides descriptive statistics for each column in a `DataFrame` |
| `DataFrame.sample()` | randomly selects a sample from a `Series` or `DataFrame` |
| `DataFrame.isin()` | returns a `DataFrame` of booleans showing whether each element was contained in the provided set of values |
| `DataFrame.isna()` | returns a `DataFrame` filled with boolean values indicating missing values |
| `DataFrame.dropna()` | removes missing values |
| `DataFrame.sample()` | randomly selects a sample from a `Series` or `DataFrame` |
| `DataFrame.sort_values()` | sorts the values in a `DataFrame` in ascending or descending order based on one or more columns |
| `DataFrame.nunique()` | returns a `Series` with the number of distinct elements in each column |
| `DataFrame.unique()` | returns unique values in order of appearance |
| `DataFrame.value_counts()` | returns a `Series` containing the counts of unique values |
| `DataFrame.sum()` | returns the column (or row) sums |
| `DataFrame.groupby()` | groups a `DataFrame` by values in one or more columns |
| `pd.to_datetime()` | converts the data type to datetime |
| `pd.to_numeric()` | converts the data type to numeric |
| `pd.to_string()` | converts the data type to string |
| `DataFrame.duplicated()` | returns a boolean `Series` denoting duplicate rows |
| `DataFrame.drop_duplicates()` | returns a `DataFrame` with duplicate rows removed |
| `DataFrame.merge()` | merges two `DataFrame` objects |
| `DataFrame.to_excel()` | saves a `DataFrame` as an Excel sheet |


#### **Importing Data**

Let's import a dataset and perform some exploratory data analysis using the pandas library. Often we'll have datasets stored as csv files and pandas makes it easy to import a csv file with the `pd.read_csv()` function. The `pd.read_csv()` function takes a file path as the argument. This can be the path to a file stored on your computer or even a URL.

We'll use a purchase card data set to test out some of the capabilities of pandas. Each row in the data set represents a purchase card transaction.

In [3]:
#define a variable, 'path', that stores the URL for the data
#this step is optional and depends on style preference
path = 'https://github.com/kayley-smiley/Python-Training/blob/main/P-Card%20Data.csv?raw=true'

#read in the data and name it 'df'
df = pd.read_csv(path)

When we first import a dataset, it's a good idea to check the size of the dataset and take a look at a couple of rows. To check the size, we can use the `DataFrame.shape` function, which returns the number of rows and columns. To use the function, replace the word `DataFrame` with the name of a `DataFrame` object. In our case, this would be `df`. This general rule applies to any pandas function with `DataFrame` in the name.



In [36]:
df.shape

(22846, 7)

To view a couple rows from the dataset, we can use the `DataFrame.head()` and `DataFrame.tail()` functions. The `DataFrame.head()` function will return rows from the beginning of the dataset and the `DataFrame.tail()` function will return rows from the end of the dataset. Both of these functions have an optional argument, `n`, to specify the number of rows that will be returned. If you don't specify `n`, the function will default to 5 rows.

In [None]:
#look at the first 5 rows
df.head()

Unnamed: 0,New ID,Vendor,Charge_Date,Cardholder_Agency,PC_Number,Billing_Amount
0,184782,AMAZON,6/25/2023,Library,PC-225445,110.26
1,276017,SQUARE,8/28/2023,Library,PC-157370,700.0
2,185196,SQUARE,8/19/2023,City Council,PC-194507,30.0
3,208673,SQUARE,8/30/2023,City Council,PC-107176,1009.55
4,175842,SQUARE,8/26/2023,City Council,PC-215581,564.5


In [None]:
#look at the last 10 rows
df.tail(n=10)

Unnamed: 0,New ID,Vendor,Charge_Date,Cardholder_Agency,PC_Number,Billing_Amount
22836,151581,AMAZON,11/24/2023,Library,PC-287196,34.63
22837,151581,AMAZON,11/20/2023,Library,PC-101340,86.76
22838,191142,AMAZON,10/4/2023,Library,PC-151135,-29.35
22839,170469,AMAZON,11/27/2023,Library,PC-252340,438.32
22840,191142,AMAZON,11/3/2023,Library,PC-275227,68.22
22841,191142,AMAZON,11/4/2023,Library,PC-275227,36.53
22842,151581,AMAZON,11/20/2023,Library,PC-184893,31.99
22843,151581,AMAZON,11/18/2023,Library,PC-110199,18.22
22844,191142,AMAZON,11/5/2023,Library,PC-275227,22.26
22845,226031,PAYPAL,9/26/2023,Library,PC-282973,139.2


#### **Data Structures**

pandas has two widely used data structures: `Series` and `DataFrame`. You can think of a `Series` as a single column of data and a `DataFrame` as a dataset that contains many rows and columns. This means that  our purchase card dataset (`df`) is a `DataFrame` and each of it's columns are `Series`.

If we only want to look at specific rows or columns of a `DataFrame`, there are several ways that we can create a subset of the data. The general way to subset a `DataFrame` is by specifying the subset inside square brackets, `[]`, after the name of the `DataFrame`.

For example, if we want to just look at the `Vendor` column from `df`, we can use the following code.



In [None]:
df['Vendor']

0        AMAZON
1        SQUARE
2        SQUARE
3        SQUARE
4        SQUARE
          ...  
22841    AMAZON
22842    AMAZON
22843    AMAZON
22844    AMAZON
22845    PAYPAL
Name: Vendor, Length: 22846, dtype: object

If we want to view multiple columns of a `DataFrame`, we can provide a list of column names inside the square brackets.

In [None]:
df[['Vendor', 'Charge_Date']]

Unnamed: 0,Vendor,Charge_Date
0,AMAZON,6/25/2023
1,SQUARE,8/28/2023
2,SQUARE,8/19/2023
3,SQUARE,8/30/2023
4,SQUARE,8/26/2023
...,...,...
22841,AMAZON,11/4/2023
22842,AMAZON,11/20/2023
22843,AMAZON,11/18/2023
22844,AMAZON,11/5/2023


Similar to what we did with lists earlier, we can also subset a `DataFrame` based on row or column numbers. Remember that Python starts indexing at 0, instead of 1! When specifying row or column numbers, the syntax for subsetting slightly changes to the following:

```
DataFrame.iloc[row number(s), column number(s)]
```


In [None]:
#view the entry in the first row and column
df.iloc[0,0]

'184782'

We can also specify a range row numbers and column numbers. Recall that the first value in a range is inclusive and the last value is exclusive!

In [None]:
#view the first two rows and first two columns
df.iloc[0:2, 0:2]

Unnamed: 0,New ID,Vendor
0,184782,AMAZON
1,276017,SQUARE


If we only include row numbers in the range, all columns for those rows will be included. Similarily, if we only include column numbers in the range, all rows for those columns will be included.

In [None]:
#view the second row in the dataframe
df.iloc[2,:]

New ID                     185196
Vendor                     SQUARE
Charge_Date             8/19/2023
Cardholder_Agency    City Council
PC_Number               PC-194507
Billing_Amount               30.0
Name: 2, dtype: object

In [None]:
#view the third column in the dataframe
df.iloc[:,3]

0             Library
1             Library
2        City Council
3        City Council
4        City Council
             ...     
22841         Library
22842         Library
22843         Library
22844         Library
22845         Library
Name: Cardholder_Agency, Length: 22846, dtype: object

We can also subset a `DataFrame` with a boolean expression, where we only include entries that meet a certain condition.

For example, one of the entries in `Cardholder_Agency` is `'Library'`. We can look at a subset of `df` that only contains rows where the `Cardholder_Agency` is `'Library'` by placing a boolean condition inside the brackets.

In [None]:
#look at rows in df where the Cardholder_Agency is Library
df[df['Cardholder_Agency'] == 'Library']

Unnamed: 0,New ID,Vendor,Charge_Date,Cardholder_Agency,PC_Number,Billing_Amount
0,184782,AMAZON,6/25/2023,Library,PC-225445,110.26
1,276017,SQUARE,8/28/2023,Library,PC-157370,700.00
6,129828,AMAZON,8/15/2023,Library,PC-201297,35.48
12,280035,AMAZON,9/12/2023,Library,PC-221503,285.88
13,280035,AMAZON,9/8/2023,Library,PC-221503,392.04
...,...,...,...,...,...,...
22841,191142,AMAZON,11/4/2023,Library,PC-275227,36.53
22842,151581,AMAZON,11/20/2023,Library,PC-184893,31.99
22843,151581,AMAZON,11/18/2023,Library,PC-110199,18.22
22844,191142,AMAZON,11/5/2023,Library,PC-275227,22.26


We can also use a compound boolean expression to subset a `DataFrame`. Let's look at the subset that contains all rows where the `Cardholder_Agency` is `'Library'` and the `Vendor` is `'AMAZON'`.

In [None]:
#look at rows in df where the Cardholder_Agency is Library and the Vendor is AMAZON
df[(df['Cardholder_Agency'] == 'Library') & (df['Vendor'] == 'AMAZON')]

Unnamed: 0,New ID,Vendor,Charge_Date,Cardholder_Agency,PC_Number,Billing_Amount,Charge_Day
8,201485,AMAZON,2023-10-16,City Council,PC-283774,87.99,Monday
9,197267,AMAZON,2023-10-30,Airport,PC-298024,26.10,Monday
10,103860,AMAZON,2023-09-28,City Council,PC-174559,16.19,Thursday
11,237456,AMAZON,2023-08-27,City Council,PC-211871,58.78,Sunday
15,216875,AMAZON,2023-08-10,Public Health,PC-261900,279.21,Thursday
...,...,...,...,...,...,...,...
22427,211100,AMAZON,2023-01-13,Transportation and Infrastructure,PC-256106,25.72,Friday
22428,211100,AMAZON,2023-01-11,Transportation and Infrastructure,PC-128938,25.72,Wednesday
22429,211100,AMAZON,2023-01-11,Transportation and Infrastructure,PC-144104,25.72,Wednesday
22432,211100,AMAZON,2023-01-13,Transportation and Infrastructure,PC-144912,25.72,Friday


If we're interested in only looking at transactions from a handful of agencies, we can use the `DataFrame.isin()` function. This function checks if each element is contained in a group of values.


In [33]:
#set of agencies we're interested in
agencies = {'City Council', 'Public Health', 'Transportation and Infrastructure'}

df[df['Cardholder_Agency'].isin(agencies)]

Unnamed: 0,New ID,Vendor,Charge_Date,Cardholder_Agency,PC_Number,Billing_Amount,Charge_Day
2,185196,SQUARE,2023-08-19,City Council,PC-194507,30.00,Saturday
3,208673,SQUARE,2023-08-30,City Council,PC-107176,1009.55,Wednesday
4,175842,SQUARE,2023-08-26,City Council,PC-215581,564.50,Saturday
7,249598,SQUARE,2023-09-07,City Council,PC-217341,1290.50,Thursday
8,201485,AMAZON,2023-10-16,City Council,PC-283774,87.99,Monday
...,...,...,...,...,...,...,...
22427,211100,AMAZON,2023-01-13,Transportation and Infrastructure,PC-256106,25.72,Friday
22428,211100,AMAZON,2023-01-11,Transportation and Infrastructure,PC-128938,25.72,Wednesday
22429,211100,AMAZON,2023-01-11,Transportation and Infrastructure,PC-144104,25.72,Wednesday
22432,211100,AMAZON,2023-01-13,Transportation and Infrastructure,PC-144912,25.72,Friday


Technically, we could have accomplished the same thing with the following code. However, using the `DataFrame.isin()` function is a cleaner approach, especially if the set of agencies was larger.

In [34]:
df[(df['Cardholder_Agency'] == 'City Council') | (df['Cardholder_Agency'] == 'Public Health') |
   (df['Cardholder_Agency'] == 'Transportation and Infrastructure')]

Unnamed: 0,New ID,Vendor,Charge_Date,Cardholder_Agency,PC_Number,Billing_Amount,Charge_Day
2,185196,SQUARE,2023-08-19,City Council,PC-194507,30.00,Saturday
3,208673,SQUARE,2023-08-30,City Council,PC-107176,1009.55,Wednesday
4,175842,SQUARE,2023-08-26,City Council,PC-215581,564.50,Saturday
7,249598,SQUARE,2023-09-07,City Council,PC-217341,1290.50,Thursday
8,201485,AMAZON,2023-10-16,City Council,PC-283774,87.99,Monday
...,...,...,...,...,...,...,...
22427,211100,AMAZON,2023-01-13,Transportation and Infrastructure,PC-256106,25.72,Friday
22428,211100,AMAZON,2023-01-11,Transportation and Infrastructure,PC-128938,25.72,Wednesday
22429,211100,AMAZON,2023-01-11,Transportation and Infrastructure,PC-144104,25.72,Wednesday
22432,211100,AMAZON,2023-01-13,Transportation and Infrastructure,PC-144912,25.72,Friday


#### **Data Types**

For the most part, there is not a big difference between the data types we talked about earlier and data types in pandas. However, we will discuss two additional data types that are specific to pandas: `objects` and `datetime objects`.

When you import a `DataFrame`, pandas attempts to guess the data type for each column. Columns with text are marked as `objects` by default. The `object` data type also includes any columns that pandas couldn't easily assign a more specific data type.

Let's check the data types for the variables in `df` using the `DataFrame.dtypes` function.

In [None]:
df.dtypes

New ID                object
Vendor                object
Charge_Date           object
Cardholder_Agency     object
PC_Number             object
Billing_Amount       float64
dtype: object

Notice that most of the variables are `objects`. Since these data types are just guesses by pandas, some may not be the most appropriate choices. For example, `float` is appropriate for `Billing_Amount`, but `object` is not the best fit for `Charge_Date`. The pandas `datetime` data type is more appropriate for `Charge_Date` because pandas contains certain functions that can only be used on `datetime` objects. For example, if we wanted to create a variable that contains the day of the week for each transaction, we can use the `Series.dt.day_name()` function, but notice we get an error if we try to use that on an `object`.











In [None]:
df['Charge_Date'].dt.day_name()

AttributeError: Can only use .dt accessor with datetimelike values

As the error says, we can only use that function with a `datetime` object, so we first need to change the data type for `Charge_Date`.

There are a couple of different ways to change data types in pandas, so we'll just cover the following functions:
  * `pd.to_datetime()` changes the data type to `datetime`
  * `pd.to_numeric()` changes the data type to `numeric`
    * pandas will further label it as `float` or `int` based on the data provided
  * `pd.to_string()` changes the data type to `string`


In [None]:
pd.to_datetime(df['Charge_Date'])

0       2023-06-25
1       2023-08-28
2       2023-08-19
3       2023-08-30
4       2023-08-26
           ...    
22841   2023-11-04
22842   2023-11-20
22843   2023-11-18
22844   2023-11-05
22845   2023-09-26
Name: Charge_Date, Length: 22846, dtype: datetime64[ns]

However, this code alone doesn't change the data type for the variable in `df`. To do this, we essentially need to save over the old `Charge_Date` variable with the new version.

In [5]:
df['Charge_Date'] = pd.to_datetime(df['Charge_Date'])

Now, let's try the `Series.dt.day_name()` function again.

In [None]:
df['Charge_Date'].dt.day_name()

0           Sunday
1           Monday
2         Saturday
3        Wednesday
4         Saturday
           ...    
22841     Saturday
22842       Monday
22843     Saturday
22844       Sunday
22845      Tuesday
Name: Charge_Date, Length: 22846, dtype: object

This looks like what we want! Let's add this new variable to `df` and name it `Charge_Day`.

In [6]:
df['Charge_Day'] = df['Charge_Date'].dt.day_name()

df

Unnamed: 0,New ID,Vendor,Charge_Date,Cardholder_Agency,PC_Number,Billing_Amount,Charge_Day
0,184782,AMAZON,2023-06-25,Library,PC-225445,110.26,Sunday
1,276017,SQUARE,2023-08-28,Library,PC-157370,700.00,Monday
2,185196,SQUARE,2023-08-19,City Council,PC-194507,30.00,Saturday
3,208673,SQUARE,2023-08-30,City Council,PC-107176,1009.55,Wednesday
4,175842,SQUARE,2023-08-26,City Council,PC-215581,564.50,Saturday
...,...,...,...,...,...,...,...
22841,191142,AMAZON,2023-11-04,Library,PC-275227,36.53,Saturday
22842,151581,AMAZON,2023-11-20,Library,PC-184893,31.99,Monday
22843,151581,AMAZON,2023-11-18,Library,PC-110199,18.22,Saturday
22844,191142,AMAZON,2023-11-05,Library,PC-275227,22.26,Sunday


#### **Checking for Data Completeness**

When we first get a dataset, it's a good idea to check for **completeness**. Data completeness is the extent to which a dataset contains all the necessary elements and observations for a given purpose or analysis.

First, we should check that the dataset includes what we expect. For example, our purchase card data (`df`) should only contain transactions from 2023. Let's double-check that the date range is what we expect. To do this, we can look at the minimum and maximum values for the `Charge_Date` column.









In [None]:
#find the minimum charge date
min(df['Charge_Date'])

Timestamp('2023-01-01 00:00:00')

In [None]:
#find the maximum charge date
max(df['Charge_Date'])

Timestamp('2023-12-26 00:00:00')

Just as we expect, the purchase card data only spans 2023.

Next, we should identify any **missing values** and decide how to handle them. To identify missing values, we can use the `DataFrame.isna()` function. This function will return a boolean `DataFrame` of the same size indicating if each entry is missing (`True`) or not (`False`). It is worth mentioning here that this function only detects true missing values, not empty strings, `''`, or strings like `'NA'`.




In [None]:
df.isna()

Unnamed: 0,New ID,Vendor,Charge_Date,Cardholder_Agency,PC_Number,Billing_Amount,Charge_Day
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
22841,False,False,False,False,False,False,False
22842,False,False,False,False,False,False,False
22843,False,False,False,False,False,False,False
22844,False,False,False,False,False,False,False


This output isn't super helpful on its own for a large `DataFrame`. It would be more useful to have a count of missing values. Remember earlier when we mentioned that adding booleans would come in handy? Since `True` is equivalent to `1`, if we find the sum of boolean values, this is equivalent to counting the missing values.

The `DataFrame.sum()` function will return the column sums for a `DataFrame` by default. If the optional `axis` argument is specified as `1`, then the row sums will be returned.

In [7]:
#sum of missing values
(df.isna()).sum()

New ID               0
Vendor               0
Charge_Date          0
Cardholder_Agency    0
PC_Number            0
Billing_Amount       0
Charge_Day           0
dtype: int64

Now that we've quantified how many values are missing, we need to decide how to handle them. This largely depends on the quantity of missing values and context of the analysis. To inform the decision making process, it can be helpful to see which rows contain missing values. To do this, we can combine the `DataFrame.isna()` function with the `DataFrame.any()` function. When applied to a `DataFrame`, the `DataFrame.any()` function returns a `Series` indicating if any row or column contains a `True` value. By defult, the function checks columns, so we need to specify the optional `axis` argument as `1`.





In [9]:
(df.isna()).any(axis = 1)

0        False
1        False
2        False
3        False
4        False
         ...  
22841    False
22842    False
22843    False
22844    False
22845    False
Length: 22846, dtype: bool

Now, we have a `Series` with `True` if the row contains at least one missing value and `False` if the row doesn't contain any missing values. If we want to view the actual rows that contain missing values, we can use this boolean `Series` to subset `df`.

In [10]:
df[(df.isna()).any(axis = 1)]

Unnamed: 0,New ID,Vendor,Charge_Date,Cardholder_Agency,PC_Number,Billing_Amount,Charge_Day


It's important to deal with missing values in an appropriate way because they can significantly affect results. For example, any arithmetic operation involving a missing value will result in a missing value.
- describe pd.NA data type



In [12]:
pd.NA

<NA>

If the missing value was due to some kind of data entry error and you know what should be in it's place,  

- dropna() function
- fillna() function

Lastly, we should check for **duplicate rows** in the dataset. To identify any duplicates, we can use the `DataFrame.duplicated()` function. This function returns a boolean `Series` indicating which rows are duplicates. The function has two optional arguments: `subset` and `keep`.
- `subset`: can be used to specify a subset of columns that should be used to identify duplicates (if we don't want to base duplicates off of every column)
- `keep`: determines which


#### **Exploratory Data Analysis**
- checking unique values
- checking for outliers
- summary statistics
- some group by summary statistics

#### **Sampling**
- how to pull a random sample

#### **Exporting Results**
- saving a specific result

### **Practice**

*Assign some EDA tasks using functions described above and concepts from the first half.*

### **Helpful Tips for Running Code**

*at the end, have them run multiple blocks at once using these shortcuts*

If we have a large number of code blocks, it can be tedious to go through and run every block individually. Instead of doing that, we have a couple of options for running multiple blocks of code.

* Runtime → Run all will run all cells in the notebook from top to bottom.

* Runtime → Run before will run all cells before the current cell.

* Runtime → Run after will run all cells after the current cell.

There are also some situations where we might want to stop the code we're running or restart our current session.

* Runtime → Restart session will give you a fresh start as if you haven't run any code. Restarting the session can be useful if you're having difficulty debugging your code. However, all of your variables will be gone!

* Runtime → Interrupt session will stop whatever code is running. This can be helpful if you accidentally code an infinite loop or a code block is taking longer to run than expected.



### **Sources**

* Gaddis, Tony. (2015). Starting out with Python, 3rd Edition.
* [W3 Schools Python Tutorial](https://www.w3schools.com/python/)
* [pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)

