# Week 2: Git Basics, Introduction to Pandas & Python Functions

## Part 2: Introduction to Pandas & Python Functions

## Introduction to Pandas

Pandas, derived from the term "panel data," is a robust, open-source data analysis library for Python. It provides fast, flexible, and expressive data structures designed to work with both structured (tabular, multidimensional, and potentially heterogeneous) and time series data.

### Basics of Pandas: Series and DataFrames

- **Series**:
  - A one-dimensional labeled array capable of holding any data type.
  - It has an index and values.
  - Example:
    ```python
    import pandas as pd
    import numpy as np

    s = pd.Series([1, 3, 5, np.nan, 6, 8])
    print(s)
    ```

- **DataFrame**:
  - A two-dimensional labeled data structure with rows and columns.
  - Can be visualized like a spreadsheet or SQL table.
  - Columns can be of different types (numeric, string, boolean, etc.).
  - Example:
    ```python
    data = {'name': ['John', 'Anna', 'Lucas'],
            'age': [28, 22, 19],
            'city': ['New York', 'London', 'Berlin']}
    df = pd.DataFrame(data)
    print(df)
    ```



### Data Manipulation: Filtering, Sorting, and Grouping

- **Filtering**:
  - Use conditions to select specific rows.
  - Example to filter records where age is greater than 20:
    ```python
    filtered_data = df[df['age'] > 20]
    print(filtered_data)
    ```

  Let's break down what is happing step by step:

    1. **`df['age']`**: This part of the code selects the 'age' column from the DataFrame `df`. The result is a pandas Series containing all the values in the 'Age' column.

    2. **`df['age'] > 20`**: This is a conditional operation that's applied to the 'age' Series. It will return another Series of the same length, but instead of ages, it will contain boolean values (`True` or `False`). A value will be `True` if the corresponding age is greater than 20, and `False` otherwise. For example, if `df['age']` contains `[18, 21, 19, 22, 25]`, then `df['age'] > 20` will return `[False, True, False, True, True]`.

    3. **`df[...]`**: The outer `df[...]` is used to index (or select) rows from the DataFrame `df`. When you use a boolean Series to index a DataFrame like this, pandas will select all rows that correspond to `True` values in the boolean Series. Using our earlier example, only the rows with ages 21, 22, and 25 will be selected.

    4. **`filtered_data = ...`**: The result of the above operations, which is a subset of the original DataFrame with only the rows where age is greater than 20, is then assigned to the variable `filtered_data`.

  In essence, the logic of `filtered_data = df[df['age'] > 20]` is: "From the DataFrame `df`, give me a new DataFrame (`filtered_data`) that only contains rows where the value in the 'Age' column is greater than 20."

- **Sorting**:
  - Arrange data based on the values of specific columns.
  - Example for sorting by age:
    ```python
    sorted_data = df.sort_values(by='age', ascending=False)
    print(sorted_data)
    ```

- **Grouping**:
  - Aggregate data based on column values.
  - Example to group by city and get the average age:
    ```python
    grouped_data = df.groupby('city')['age'].mean()
    print(grouped_data)
    ```
  - Note the structure 
    1. **`df.groupby('city')`**: This groups the DataFrame `df` by the unique values in the 'city' column. The result is a `GroupBy` object which is a special type of pandas object that has similar properties to a DataFrame but represents a collection of groups (or segments) of your data.
    2. **`df.groupby('city')['age']`**: This selects the 'age' column from each group. Now, you have a `GroupBy` object that is focused on the 'age' values for each city.
    3. **`.mean()`**: This calculates the mean (or average) of the 'age' values for each city.

  

## Exercise 2: Creating a dataframe

Two steps before you start:

1. Assuming that you have cloned this repository to your Google Drive, go to Google drive and navigate to the folder `MyDrive/git_projects/AE_772_892/` and open the notebook `Lectures/892_Lecture_2_part_2`.
2. Then
  - Connect your google drive folder to the notebook
  - Set your working directory 
  - Stash any changes you have made to the repository
  - Pull any changes that have been made to the repository


In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/git_projects/AE_772_892
!git stash
!git pull

In [None]:
Now do the following in the codeblock below:



- Create a dataframe with the following columns for the entire class and call it `df_class` (if you are the only person doing this you can make up the data):
  - Name
  - Surname
  - Age
  - Home Province
  - Favorite Color
  - Favorite Food
  - Course target mark
- What is the average age of the class?
- What is the average age of the class by province?
- Filter your dataframe by age and show only students older than 22
- Sort your dataframe by target mark in descending order
- Group your dataframe by home province and show the average age of students in each home province

### Reading and Writing Data Using Pandas

- **Reading Data**:
  - Pandas can read multiple file formats including CSV, Excel, SQL databases, and more.
  - Example for reading a CSV file:
    ```python
    data = pd.read_csv('filename.csv')
    print(data.head())  # Display first 5 rows
    ```
  - note that you must set your working directory correctly and specify the file path if the file is not in the same directory as your notebook. 
  

## Exercise 3: Importing a dataframe

Import the possum dataset as a dataframe and call it `df_pos`. Lets answer the following questions together: 
- How does the data look like - print the first five rows of the dataframe: `print(df_pos.head(5))`
- What are the column names? `print(df_pos.columns)`
- What are the data types of the columns? `print(df_pos.dtypes)`
- What is the structure of the dataframe? `print(df_pos.shape)`
- Describe the data - what are the summary statistics of the dataframe? `print(df_pos.describe())`
- What is the average age of the possums? `print(df_pos['age'].mean())`


- **Writing Data**:
  - DataFrames can be saved to a variety of file formats.
  - Example for writing to an Excel file:
    ```python
    df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
    ```

- **File Formats**:
  - Pandas supports a variety of file formats including:
    - Text formats such as CSV, JSON, and HTML.
    - Binary formats such as Excel, HDF5, and Parquet.
    - SQL databases like SQLite, PostgreSQL, and MySQL.