# Python and Data Analysis 1 - Introduction to Pandas DataFrames

**Goal:** The goal of this project is to become comfortable working with data in Pandas.

**Description:** This project will cover the basics of Pandas: importing, manipulating, and filtering data in Pandas DataFrames. Becoming fluent takes practice, but is necessary when building larger data analysis projects. After this project, you should feel comfortable navigating any dataset using Pandas.

## 1A: Data Basics

### Types of Data

At the most fundemental level, there are three kinds of data:
 - *Qualitative/Categorical*: Country, Industry, Faculty...
 - *Quantitative/Numerical*: Height, Rank, Price...
 - *Identifying*: Stock Ticker, Card Number, Product ID
 
Pandas DataFrames can store different data types, but the basic ones we will focus on here are:
 - *Strings* (`object`): Text
 - *Integers* (`int64`): Numbers without decimals
 - *Floats* (`float64`): Numbers with decimals
 - *Booleans* (`bool`): Values that are either True or False
 - *DateTimes* (`datetime64`): Values that store a specific date and time

### Importing and Displaying Data

We can import data from CSV (comma separated value) files with `read_csv`. The following stock dataset is sourced from https://www.macrotrends.net/.

In [None]:
import pandas as pd
df = pd.read_csv('MSFT.csv') # Import the CSV

We can then print the entire DataFrame.

In [None]:
print(df)

Print the first five rows with `head`.

In [None]:
print(df.head())

Print the last five rows with `tail`.

In [None]:
print(df.tail())

Use `dtype` to determine the type of data.

In [None]:
df['close'].dtype # Find the type of data in the 'close' column

**Challenge**: Guess what the data type of the 'volume' column is, then write code to display it.

In [None]:
df['volume'].dtype

### DataFrame Elements

DataFrames have three key elements:
 - *Data*: The content of the DataFrame
 - *Columns*: Horizontal headings (`date`, `open`, `high`, `low`, `close`, `volume`) uniquelely identifying a column
 - *Index*: Vertical values uniquely identifying a row (the numbers on the leftmost side)
 
 
Note: The *position* of a column or index is the order in which it appears, starting from 0. The column positions are numbered 0, 1, 2, 3, 4, 5, in order of appearance. In this DataFrame, the index is the same as the index position for all rows (index at position 0 has value 0).

## 1B: Slicing and Dicing Data

When analyzing data, we need to be able to manipulate DataFrames to select the exact subset of data that we want. Here, we will cover a few key techniques.

### Column Select

If we want to get data from a particular column, we can use the syntax `df-name['column-name']`. This returns a Pandas *Series*. For example, if we want to get the `date` column from the DataFrame called `df`, we can do:

In [None]:
dates = df['low']
print(dates)

To select multiple columns, we use `df-name[['col1', 'col2', ...]]`. Instead of passing a single column, we pass a list of columns. Instead of returning a Series, Pandas will return a DataFrame. This also allows us to rearrange the order of columns.

In [None]:
multiple_cols = df[['volume', 'close', 'date']]
print(multiple_cols)

**Challenge**: Print out the 'high' and 'low' columns.

In [None]:
print(df[['high', 'low']])

### Row Select

Pandas provides two ways to select particular rows from a DataFrame.

#### Row Select by Value

We use the row's index value to select the row, following the syntax `df_name.loc[index_value]`. This will return a Series. Here, the index value is the same as its position (the index value at position 0 is 0), but this is not always the case. The index value can be a non-integer, but its position will always be an integer from 0 to the number of rows minus one.

In [None]:
first_row = df.loc[0]
print(first_row)

For multiple rows, we can pass a list of rows, or a range using `df_name.loc[[values_list]]` or `df_name.loc[start_value:end_value]`.

In [None]:
first_five_rows = df.loc[[0,1,2,3,4]] # Access the rows with index 0,1,2,3,4
print(first_five_rows)

first_five_rows_range = df.loc[0:4] # Access the rows starting with index 0 up to and including index 4
print(first_five_rows_range)

**Challenge**: Print out rows with index values 1000 to 2000, then print out rows with index values 1000, 1500, and 2000.

In [None]:
rows_1000_to_2000 = df.loc[1000:2000]
# print(rows_1000_to_2000)

rows = df.loc[[1000, 1500, 2000]]
print(rows)

# print(df.loc[1000:2000])
# print(df.loc[[1000, 1500, 2000]])

#### Row Select by Position

Instead of using the value of the index, we can also use its position. This is useful if our index values are not common-sense integers. If our index was a string, but we wanted to select the first 20 rows of data, `loc` would not work unless we knew the first 20 values in our index. With the exception of using positions instead of values, `iloc` is quite similar to `loc`. We use the syntax `df_name.iloc[index_position]` for a single row and `df_name.iloc[[positions_list]]` or `df_name.iloc[start_position:end_position]` for multiple rows.

In [None]:
row_at_position_4 = df.iloc[4] # Access row with position 4
# print(row_at_position_4)

last_five_rows = df.iloc[[-5,-4,-3,-2,-1]] # Access rows with positions -5,-4,-3,-2,-1
# print(last_five_rows)

last_five_rows_range = df.iloc[-5:] # Access rows with positions -5 to the end
print(last_five_rows_range)

**Challenge**: Select the last 2000 rows in the dataframe **except for the last row**. Hint: the last row always has position -1.

In [None]:
print(df.iloc[2000: -1])

### Subsection Select

To select a subsection of rows and columns at the same time, we can use either `df_name.loc[row_values, column_values]` or `df_name.iloc[row_positions, column_positions]`. The key difference is `loc` takes the value itself, whereas `iloc` takes the position.

In [None]:
first_5_close_and_vol = df.loc[[0,1,2,3,4], ['close', 'volume']] # Access the 'close' and 'volume' columns for the rows with index values 0,1,2,3,4
# print(first_5_close_and_vol)

last_2_date_close_and_volume = df.iloc[-3:-1, [0,4,5]] # Access the 'date' and 'close' and 'volume' columns for rows with index position -2 to the end
print(last_2_date_close_and_volume)

**Challenge**: Select and display the `high` and `low` prices for the first row, last row, and row 5000 using both `loc` and `iloc`. When you use `iloc`, try to ensure it accesses the last row regardless of the number of rows in the dataframe.

In [None]:
# print(df.loc[[0, 5000, 8633], ['high', 'low']])
# print(df.iloc[[0, 5000, -1], [2, 3]])

rows = df.loc[[0, 5000, len(df) - 1], ["high", "low"]]
print(rows)

rows_2 = df.iloc[[0, 5000, -1], [2, 3]]
print(rows_2)

## 1C: Filtering Data

Perhaps more interesting than manually selecting data, *filtering* allows us to access a subset of rows based on a condition or set of conditions. For example, if we were only interested in rows where the high price was greater than 1, we would filter out all other rows. To do this, we use the syntax `df_name[condition]`.

In [None]:
filtered = df[df['high'] > 1] # Select only the rows where high > 1
print(filtered)

We can combine more than one condition using the bitwise comparators `&` (and) and `|` (or). If we wanted the rows where `volume` was larger than 35 Million and `close` was greater than 184, we could combine the two conditions with `&`.

In [None]:
filtered = df[(df['volume'] > 35000000) & (df['close'] > 184)]
print(filtered)

Notice that the number of rows decreases the more conditions we combine. This makes sense, because as we combine conditions, more rows fail to meet the criteria and are filtered out, resulting in a smaller DataFrame.

**Challenge**: Try selecting and displaying only the rows where the difference between the open and close price is greater than 5. Note: you can get the difference between two columns by subtracting them.

In [None]:
result = df[(df["open"] - df["close"]) > 5]
print(result)