# Lecture 2: Git Fundamentals, Intro to Pandas & Python Functions

## Part 1: Intro to Git & Its Application in Google Colab

### Why Git? 
Git is a version control system that:
- Helps track changes in code.
- Facilitates collaboration.
- Is essential for modern software development & data science.

### Brief History:
- **Created by**: Linus Torvalds.
- **Year**: 2005.
- **Purpose**: To have a distributed version control system for large projects without a central server dependency.

### Git Workflow:
- **Local Working Directory**: Directly work on files here.
- **Staging Area**: Prepare and review commits.
- **Local Repository**: Store commits on your machine.
- **Remote Repository**: Hosted project version online or on a network.

<img src="https://drive.google.com/uc?id=12Sz7YdjFoVmY49HJ4DzrzUkCwqsYxP0v" alt="Alt text" width="750"/>

### Key Commands:
- `git clone`: Copy a repository.
- `git stash`: Temporarily save changes.
- `git pull`: Fetch updates from online repository.

### Using Git in Google Colab:
Google Colab allows terminal commands with a "!" prefix. It differentiates between Python and terminal commands (`bash`). Example:

```python
!git clone <repository_url>
```

## Exercise 1: Cloning your class repository

IMPORTANT: If you get stuck - read this [page](https://medium.com/analytics-vidhya/how-to-use-google-colab-with-github-via-google-drive-68efb23a42d) and follow the instructions.

From now on I'm not going to share individual notebooks with you but rather the class repository. You will need to clone the repository to your Google Drive and then open the notebooks from there. To do so you must follow the following steps.

1. Open your Google Drive page and create a new folder called "git_projects" (you have must call it this, exactly) in your `My Drive` folder.
2. Go to Google Colab and create a new notebook and call it `Project_Setup.ipynb`.
3. Mount your Google Drive by running the following code in the first cell of your notebook:
   ```python
   from google.colab import drive
   drive.mount('/content/drive')
   ```
4. Set your working directory to the folder you created in step 1 by running the following code in the second cell of your notebook:. Your working directory is the folder where you want to store your code. You can think of it as your "workspace".
   ```python
   %cd /content/drive/MyDrive/git_projects/
   ```
5. Clone the class repository by running the following code in the third cell of your notebook:
   ```python
   !git clone https://github.com/jancgreyling/AE_772_892.git
   ```
- This step will clone the repository and create a folder called `AE_772_892` in your `git_projects` folder.
6. Open the `AE_772_892` folder in your Google Drive and navigate to the `Lectures` folder and open the notebook for this lecture. You can open it in Google Colab by right-clicking on the notebook and selecting `Open with` and then `Google Colaboratory`.

**Note the following:**
- You only need to do this once. From now on you can open the notebooks directly from your Google Drive.
- Once you've created your repository, you cannot clone it again, if want the get update it with the latest changes, you can do so by pulling the latest version of the repo into your notebook:
   ```python
   !git pull
   ```
- YOU NEED TO DO THIS AT THE START OF EACH LECTURE TO GET THE LATEST NOTEBOOKS.
- If you've made changes to the notebooks, you will get an error when you try to pull the latest changes. You will need to commit your changes first. We will discuss this in more detail in the next lecture. For now you can stash (move to a separate branch) your changes by running the following code:
      ```python
      !git stash
      ```


- VERY IMPORTANT:
   - You can only run the code above if you are in the correct directory. If you are not in the correct directory, you will get an error. To change your directory, you can run the following code:
      ```python
      %cd /content/drive/MyDrive/git_projects/AE_772_892/
      ```
   - When repo is cloned, or initialised, it contains a set of hidden git folders. When you open the repo in Google Drive, and you see these folder, DO NOT DELETE THESE FOLDERS. If you do, you will break the link between your Google Drive and the GitHub repository.
   - For this reason you you can only `stash` and `pull` code if you are in the correct directory. If you are not in the correct directory, you will get an error. To change your directory, you can run the following code:
      ```python
      %cd /content/drive/MyDrive/git_projects/AE_772_892/
      ```

# Week 2 Part II: Introduction to Pandas

Pandas, derived from the term "panel data," is a robust, open-source data analysis library for Python. It provides fast, flexible, and expressive data structures designed to work with both structured (tabular, multidimensional, and potentially heterogeneous) and time series data.

## Basics of Pandas: Series and DataFrames

- **Series**:
  - A one-dimensional labeled array capable of holding any data type.
  - It has an index and values.
  - Example:
    ```python
    import pandas as pd
    import numpy as np

    s = pd.Series([1, 3, 5, np.nan, 6, 8])
    print(s)
    ```

- **DataFrame**:
  - A two-dimensional labeled data structure with rows and columns.
  - Can be visualized like a spreadsheet or SQL table.
  - Columns can be of different types (numeric, string, boolean, etc.).
  - Example:

In [55]:
import pandas as pd

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'age': [25, 30, 35, 40, 45, 30],
    'city': ['New York', 'London', 'Berlin', 'New York', 'London', 'Berlin'],
    'salary': [50000, 55000, 60000, 65000, 70000, 55000]}

df = pd.DataFrame(data)
print(df)

      name  age      city  salary
0    Alice   25  New York   50000
1      Bob   30    London   55000
2  Charlie   35    Berlin   60000
3    David   40  New York   65000
4      Eve   45    London   70000
5    Frank   30    Berlin   55000


## Data Manipulation: Filtering using logical statements

See lecture 1 for a recap of logical statements, this is the pandas implementation thereof


Use dataframe logical statements to select specific rows.
  - Example to filter records where age is greater than 20:

In [56]:
filtered_data = df[df['age'] > 30]
print(filtered_data)

      name  age      city  salary
2  Charlie   35    Berlin   60000
3    David   40  New York   65000
4      Eve   45    London   70000


Let's break down what is happing step by step:

  1. **`df['age']`**: This part of the code selects the 'age' column from the DataFrame `df`. The result is a pandas Series containing all the values in the 'Age' column.

  2. **`df['age'] > 30`**: This is a conditional operation that's applied to the 'age' Series. It will return another Series of the same length, but instead of ages, it will contain boolean values    
    (`True` or `False`). A value will be `True` if the corresponding age is greater than 20, and `False` otherwise. For example, if `df['age']` contains `[18, 21, 19, 22, 25]`, then `df['age'] > 20` will 
    return `[False, True, False, True, True]`.

  3. **`df[...]`**: The outer `df[...]` is used to index (or select) rows from the DataFrame `df`. When you use a boolean Series to index a DataFrame like this, pandas will select all rows that 
    correspond to `True` values in the boolean Series. Using our earlier example, only the rows with ages 21, 22, and 25 will be selected.

  4. **`filtered_data = ...`**: The result of the above operations, which is a subset of the original DataFrame with only the rows where age is greater than 20, is then assigned to the variable 
    `filtered_data`.

In essence, the logic of `filtered_data = df[df['age'] > 30]` is: "From the DataFrame `df`, give me a new DataFrame (`filtered_data`) that only contains rows where the value in the 'Age' column is greater than 20."


## Data Manipulation: Filtering using `loc` - location based indexing

### Basics:

The `loc` attribute is primarily label-based indexing. It's used to access a group of rows and columns by labels or a boolean array.

### Syntax:
```python
dataframe.loc[row_indexer, column_indexer]
```

In [62]:
print(df)

      name  age      city  salary
0    Alice   25  New York   50000
1      Bob   30    London   55000
2  Charlie   35    Berlin   60000
3    David   40  New York   65000
4      Eve   45    London   70000
5    Frank   30    Berlin   55000



### Examples:

1. **Selecting a Single Row by Index**:

    Assuming indices are default integers:

In [63]:
row_0_series = df.loc[0]
row_0_df = df.loc[[0]]

print(row_0_series)
print("")
print(row_0_df)

name         Alice
age             25
city      New York
salary       50000
Name: 0, dtype: object

    name  age      city  salary
0  Alice   25  New York   50000


2. **Selecting Rows by Range of Index**:

    Get rows 1 through 3:

In [64]:
rows_1_to_3 = df.loc[1:3]
print(rows_1_to_3)

      name  age      city  salary
1      Bob   30    London   55000
2  Charlie   35    Berlin   60000
3    David   40  New York   65000


3. **Selecting Specific Columns for Specific Rows**:

    For rows 1 through 3, get columns 'name' and 'age':

In [65]:
subset = df.loc[1:3, ['name', 'age']]
print(subset)

      name  age
1      Bob   30
2  Charlie   35
3    David   40




4. **Selecting All Rows for Specific Columns**:


In [66]:
names_and_cities = df.loc[:, ['name', 'city']]
print(names_and_cities)

      name      city
0    Alice  New York
1      Bob    London
2  Charlie    Berlin
3    David  New York
4      Eve    London
5    Frank    Berlin


5. **Using Boolean Conditions**:

    Select all rows where age is above 30:

In [67]:
above_30 = df.loc[df['age'] > 30]
print(above_30)

      name  age      city  salary
2  Charlie   35    Berlin   60000
3    David   40  New York   65000
4      Eve   45    London   70000


6. **Using Multiple Conditions**:
    Select all rows where age is above 30 and city is 'London':

In [68]:
london_above_30 = df.loc[(df['age'] > 30) & (df['city'] == 'London')]
london_above_30

Unnamed: 0,name,age,city,salary
4,Eve,45,London,70000




7. **Setting Values for Specific Rows/Columns**:
    Set age to 40 for the person named 'Alice':

In [69]:
df.loc[df['name'] == 'Alice', 'age'] = 40


### Important Points:

- When using `loc` with a single bracket (like `df.loc[2]`), it'll return a Series representing that row. If you want it as a DataFrame, use a double bracket (like `df.loc[[2]]`).

- The end value in a range specified in `loc` (like `df.loc[1:3]`) is **inclusive**, which is different from Python's standard slicing where the end value is exclusive.

- Always ensure the values you're using within `loc` match the data type of the index. If the index is of string type, use string values, and so on.

- Be careful when modifying slices. You're directly modifying the original DataFrame unless you explicitly work on a copy.


____

## Tutorial 2, Part 1: Creating a dataframe

First
- Create a new Jupyter notebook
- Rename it to \<your_name>\<Lecture_2_Tutorial>
- Share with me: jan5020@gmail.com

Then setup your notebook by following the steps below:

1. Mount your Google Drive by running the following cell and following the instructions:

   ```python
   from google.colab import drive
   drive.mount('/content/drive')
   ```

2. Set your working directory to the folder where you have your data:

   ```python
   %cd /content/drive/MyDrive/git_projects/AE_Python_Git
   ```

Now do the following:

- Create a dataframe using fictional data with the following columns for 5 people in a class and call it `df_class`:
  - Name
  - Surname
  - Age
  - Home Province
  - Favorite Color
  - Favorite Food
  - Course target mark
- What is the average age of the class?
- Use your dataset to illustrate the difference between `loc` and `iloc` by selecting the name of the first person in the class using both methods.

____
____

## Data Manipulation: Subsetting

### General subsetting

**Subsetting using one column**:
   Select only the 'Name' column:

In [70]:
names = df['name']
names

0      Alice
1        Bob
2    Charlie
3      David
4        Eve
5      Frank
Name: name, dtype: object

Note the difference...

In [71]:
names = df[['name']]
names

Unnamed: 0,name
0,Alice
1,Bob
2,Charlie
3,David
4,Eve
5,Frank


**Subsetting with more than one column**:
   Select the 'Name' and 'City' columns:

In [72]:
subset = df[['name', 'city']]
subset

Unnamed: 0,name,city
0,Alice,New York
1,Bob,London
2,Charlie,Berlin
3,David,New York
4,Eve,London
5,Frank,Berlin


**Subsetting using text data**:
   Select rows where 'City' is 'London':


In [73]:
london_data = df[df['city'] == 'London']
london_data

Unnamed: 0,name,age,city,salary
1,Bob,30,London,55000
4,Eve,45,London,70000


**Subsetting using multiple conditions**:
   Select rows where 'City' is 'London' and 'Age' is greater than 30:

In [74]:
specific_data = df[(df['city'] == 'London') & (df['age'] > 30)]
specific_data

Unnamed: 0,name,age,city,salary
4,Eve,45,London,70000


**Using `isin`**:
   Select rows where 'Name' is either 'Alice' or 'Bob':


In [75]:
specific_names = df[df['name'].isin(['Alice', 'Bob'])]
specific_names

Unnamed: 0,name,age,city,salary
0,Alice,40,New York,50000
1,Bob,30,London,55000


**Using `isin` for "is not in" condition**:

   To filter rows where the 'name' is neither 'Alice' nor 'Bob':

In [76]:
not_specific_names = df[~df['name'].isin(['Alice', 'Bob'])]
not_specific_names

Unnamed: 0,name,age,city,salary
2,Charlie,35,Berlin,60000
3,David,40,New York,65000
4,Eve,45,London,70000
5,Frank,30,Berlin,55000


   The `~` operator is a bitwise negation, which inverts the boolean values returned by the `isin` method, effectively giving us the "is not in" condition.

## Data Manipulation: Sorting

  - Arrange data based on the values of specific columns.
  - Example for sorting by age:


In [77]:
sorted_data = df.sort_values(by='age', ascending=False)
print(sorted_data)

      name  age      city  salary
4      Eve   45    London   70000
0    Alice   40  New York   50000
3    David   40  New York   65000
2  Charlie   35    Berlin   60000
1      Bob   30    London   55000
5    Frank   30    Berlin   55000


**Sorting by multiple variables in different directions**:
   Sort by 'City' in ascending order and then by 'Age' in descending order:

In [78]:
sorted_df = df.sort_values(by=['city', 'age'], ascending=[True, False])
sorted_df

Unnamed: 0,name,age,city,salary
2,Charlie,35,Berlin,60000
5,Frank,30,Berlin,55000
4,Eve,45,London,70000
1,Bob,30,London,55000
0,Alice,40,New York,50000
3,David,40,New York,65000


## Data Manipulation: Creating variables in Pandas

1. **Basic Assignment**:

In [80]:
df['bonus'] = 500  # Adds a new column named 'new_column' with all values set to 100
print(df)

      name  age      city  salary  bonus
0    Alice   40  New York   50000    500
1      Bob   30    London   55000    500
2  Charlie   35    Berlin   60000    500
3    David   40  New York   65000    500
4      Eve   45    London   70000    500
5    Frank   30    Berlin   55000    500


2. **Using Mathematical Operations**:
    Suppose you have 'salary' and 'bonus' columns and want to compute the total compensation:

In [81]:
df['total_compensation'] = df['salary'] + df['bonus']
print(df)

      name  age      city  salary  bonus  total_compensation
0    Alice   40  New York   50000    500               50500
1      Bob   30    London   55000    500               55500
2  Charlie   35    Berlin   60000    500               60500
3    David   40  New York   65000    500               65500
4      Eve   45    London   70000    500               70500
5    Frank   30    Berlin   55000    500               55500


3. **Using String Operations**:
    If you have a 'name' column and want to create a new column with the name length:


In [82]:
df['name_length'] = df['name'].str.len()
df

Unnamed: 0,name,age,city,salary,bonus,total_compensation,name_length
0,Alice,40,New York,50000,500,50500,5
1,Bob,30,London,55000,500,55500,3
2,Charlie,35,Berlin,60000,500,60500,7
3,David,40,New York,65000,500,65500,5
4,Eve,45,London,70000,500,70500,3
5,Frank,30,Berlin,55000,500,55500,5


----

## Tutorial 2, Part 3: Sorting, Filtering, and Subsetting


----
----

## Comparing `loc` and `iloc` for DataFrame Filtering in Pandas

### Key Distinctions

- **Range End Inclusion**: In `iloc`, the end value of a specified range is exclusive, aligning it with Python's standard slicing syntax. On the other hand, `loc` includes the end value in the range.
  
- **Indexer Types**: `iloc` only accepts integer-based indexers or integer arrays. In contrast, `loc` permits label-based indexers and boolean conditions.

- **Output Format**: When using either `loc` or `iloc` with a single bracket (e.g., `df.loc[2]` or `df.iloc[2]`), the result is a Series. To obtain a DataFrame instead, employ double brackets (e.g., `df.loc[[2]]` or `df.iloc[[2]]`).

- **Indexing Order**: Both `loc` and `iloc` follow a row-first, column-second ordering scheme, which deviates from Python's native column-first, row-second pattern.

- **Flexibility and Constraints**: `loc` provides more versatility in indexing by accepting labels and boolean conditions. `iloc`, being more restrictive, accepts only integers.