# Pandas

Pandas is a Python module for working with tabular data (i.e., data in a table with rows and columns). Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of Python.

A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

DataFrames have rows and columns. Each column has a name, which is a string. Each row has an index, which is an integer. DataFrames can contain many different data types: strings, ints, floats, tuples, etc.

## Creating a DataFrame

We can pass in a dictionary to pd.DataFrame(). Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error. Here’s an example:

In [2]:
import pandas as pd
df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
    'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
    'age': [34, 28, 51]
})

![f](https://i.imgur.com/4hayqxo.jpg)

We can also add data using lists.

For example, we can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns to pass a list of column names.

In [3]:
df2 = pd.DataFrame([
    ['John Smith', '123 Main St.', 34],
    ['Jane Doe', '456 Maple Ave.', 28],
    ['Joe Schmo', '789 Broadway', 51]
    ],
    columns=['name', 'address', 'age'])

![p](https://i.imgur.com/zzIr5sh.jpg)

## CSV Files

CSV (comma separated values) is a text-only spreadsheet format. We can find CSVs in lots of places:

1. Online datasets 
2. Export from Excel or Google Sheets
3. Export from SQL

The first row of a CSV contains column headings. All subsequent rows contain values. Each column heading and each variable is separated by a comma:

![o](https://i.imgur.com/oYYYl34.jpg)

### Loading and Saving CSVs

When we have data in a CSV, you can load it into a DataFrame in Pandas using .read_csv():

pd.read_csv('my-csv-file.csv')

In the example above, the .read_csv() method is called. The CSV file called my-csv-file is passed in as an argument.

We can also save data to a CSV, using .to_csv().

df.to_csv('new-csv-file.csv')

In the example above, the .to_csv() method is called on df (which represents a DataFrame object). The name of the CSV file is passed in as an argument (new-csv-file.csv). By default, this method will save the CSV file in your current directory.

### Inspecting a DataFrame

When we load a new DataFrame from a CSV, we want to know what it looks like.

If it’s a small DataFrame, you can display it by typing print(df).

If it’s a larger DataFrame, it’s helpful to be able to inspect a few items without having to look at the entire DataFrame.

The method .head() gives the first 5 rows of a DataFrame. If you want to see more rows, you can pass in the positional argument n. For example, df.head(10) would show the first 10 rows.

The method df.info() gives some statistics for each column.

### Selecting Columns

Now we know how to create and load data. Let’s select parts of those datasets that are interesting or important to our analyses.

![p](https://i.imgur.com/JypozZe.jpg)

There are two possible syntaxes for selecting all values from a column:

1. Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type customers['age'] to select the ages.
2. If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation: df.MySecondColumn. In our example, we would type customers.age.

When we select a single column, the result is called a Series.

![i](https://i.imgur.com/Ir47LXX.jpg)

## Selecting Multiple Columns

When we have a larger DataFrame, we might want to select just a few columns.

To select two or more columns from a DataFrame, we use a list of the column names.

![p](https://i.imgur.com/DzRMD3A.jpg)

## Select Rows

![p](https://i.imgur.com/pTH2vQV.jpg)

Maybe our Customer Service department has just received a message from Joyce Waller, so we want to know exactly what she ordered. We want to select this single row of data.

DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. Joyce Waller’s order is the 2nd row.

We select it using the following command:

orders.iloc[2]

![p](https://i.imgur.com/KSLsRnN.jpg)

## Selecting Multiple Rows

We can also select multiple rows from a DataFrame.

1. orders.iloc[3:7] would select all rows starting at the 3rd row and up to but not including the 7th row (i.e., the 3rd row, 4th row, 5th row, and 6th row)

![p](https://i.imgur.com/LZ2mVtl.jpg)

2. orders.iloc[:4] would select all rows up to, but not including the 4th row (i.e., the 0th, 1st, 2nd, and 3rd rows)

3. orders.iloc[-3:] would select the rows starting at the 3rd to last row and up to and including the final row

![p](https://i.imgur.com/nIJ6LKH.jpg)

## Select Rows with Logic 

We can select a subset of a DataFrame by using logical statements:

df[df.MyColumnName == desired_column_value]

![p](https://i.imgur.com/KRcOdbU.jpg)

Suppose we want to select all rows where the customer’s age is 30. We would use:

df[df.age == 30]

In Python, == is how we test if a value is exactly equal to another value.

We can use other logical statements, such as:

1. Greater Than, > — Here, we select all rows where the customer’s age is greater than 30:

df[df.age > 30]

2. Less Than, < — Here, we select all rows where the customer’s age is less than 30:

df[df.age < 30]

3. Not Equal, != — This snippet selects all rows where the customer’s name is not Clara Oswald:

df[df.name != 'Clara Oswald']

![p](https://i.imgur.com/SIOgF53.jpg)

We can also combine multiple logical statements, as long as each statement is in parentheses.

![p](https://i.imgur.com/OQBayAK.jpg)

![s](https://i.imgur.com/IRrnwzW.jpg)

![p](https://i.imgur.com/8Ve1IDU.jpg)

![p](https://i.imgur.com/yXw3yR6.jpg)

## Setting Indices

![p](https://i.imgur.com/9X2RJaK.jpg)

![p](https://i.imgur.com/h4ieuNA.jpg)

![p](https://i.imgur.com/xSNAZPv.jpg)

## Review or Summary of Whatever we discussed above

![i](https://i.imgur.com/iSQOQUz.jpg)

![p](https://i.imgur.com/Oar7AI1.jpg)

## Quiz Questions and Solutions
  
![p](https://i.imgur.com/h7XM00n.jpg)

![p](https://i.imgur.com/dMSWjos.jpg)

![p](https://i.imgur.com/Xk5ge58.jpg)

![p](https://i.imgur.com/SKmRbKq.jpg)

![p](https://i.imgur.com/R0Umw94.jpg)

![p](https://i.imgur.com/zPP2EDT.jpg)

![p](https://i.imgur.com/0KwiJbI.jpg)

![p](https://i.imgur.com/gyHV64w.jpg)

![p](https://i.imgur.com/WXFAxcP.jpg)

# Modifying DataFrames

## Adding a Column

Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

![p](https://i.imgur.com/TYFJXoS.jpg)

We can also add a new column that is the same for all rows in the DataFrame. 

![p](https://i.imgur.com/UpFipyV.jpg)

 We can add a new column by performing an operation on the existing columns.
 
 ![p](https://i.imgur.com/XjgREO9.jpg)