**Import pandas**

In [None]:
import pandas as pd  #pd commonly used abbreviation

# Pandas DataFrame

A DataFrame in Pandas is a two-dimensional labeled data structure that resembles a table or spreadsheet. It consists of rows, labeled by an index, and columns, identified by a unique column name. Each column can contain different data types (e.g., integers, floats, strings). Pandas DataFrames offer flexible indexing as you can acces rows, columns or individual elements based on their labels or integer positions. In contrary to Series, you can change the size of a DataFrame.\
\
DataFrames are highly versatile and widely used for data manipulation, analysis, and visualization tasks in Python.

## 1. Creating a DataFrame

### From a dictionary

Below you see a dictionary `data` with as keys Name, Age, City and Salary. The values are lists containing respectively the names, ages, cities and salaries of 5 people. Let's make a Pandas DataFrame based on this data.

In [None]:
# Sample data (dict) for DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['London', 'New York', 'Paris', 'Tokyo', 'Sydney'],
    'Salary': [60000, 75000, 80000, 70000, 65000]
}

In [None]:
# Creating a DataFrame
df = pd.DataFrame(data)

# Displaying the DataFrame
df

Inspecting the DataFrame, you see the column names correspond to the keys of the dictionary, and the data of each column to the values in the corresponding list.

### From a list of lists

Below we have a list `data` where each value is a list itself containing data from a single person.

In [None]:
data = [['John', 25, 'New York'],
        ['Alice', 30, 'Los Angeles'],
        ['Bob', 35, 'Chicago'],
        ['Charlie', 35, 'New York'],
        ['David', 28, "Tokyo"]]

df = pd.DataFrame(data)
df

If we create a DataFrame, only using `data` as argument to the function, the column names are the integer positions. If we want to give meaningfull column names, we need to specify this when creating the DataFrame. Let's make a list `columns` containing the column names and add it as an argument to the function.

In [None]:
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
df

We can also provide custom indices.

In [None]:
idx = ["person1", "person2", "person3", "person4", "person5"]
df = pd.DataFrame(data, columns=columns, index=idx)
df

## 2. Renaming columns and rows

### Renaming columns

It might happen that we want to change our column names. Luckily it is not necessary to create the DataFrame anew.

In [None]:
df.columns = ["N", "A", "C"]
df

It is also possible to only change some of the columns. To do this, we use `.rename()` and we provide a dictionary with as key the current column name and as value the new column name. \
\
As said before. Pandas favors creating a new object over changing the existing one. However, if we want to change our current DataFrame instead of creating a new one, we use the argument `inplace=True`.

In [None]:
df.rename(columns={"N": "Name"}, inplace=True)
df

### Renaming rows

Likewise we can rename the row indices of a Pandas DataFrame, all at once or only a selection of row indices.

In [None]:
# Renaming all indices
df.index = ["p1", "p2", "p3", "p4", "p5"]
df

In [None]:
#Renaming a selection of indices
df.rename(index={"p2": "person2"}, inplace=True)
df

🧰 **Task**
* Find a method to reset de indices of a df to their integer values, while creating a new column, named 'index', with the current index names as data.


In [None]:
# Your code

### 💼 Exercise
Go to the exercise notebook and make exercise **1. Soccer**.

## 3. Indexing

Ensure your dataframe has no longer the 'index' column, the row indices are set to "person1" etc. and the columns have meaningful names.
* Remove the column "index" (<a href= https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html>Check out the documentation of df.drop </a>)
* Assign the index labels
* Change the column names "A" and "C", to "Age" and "City" respectively.

In [None]:
df.drop(["index"], axis = 1, inplace=True)
df.index = ["person1", "person2", "person3", "person4", "person5"]
df.rename(columns={"A": "Age", "C": "City"}, inplace=True)


### Columns

In [None]:
# Accessing a column by name
print(df['Name'])

### Rows
When working with index labels, we use `loc`, working with integers, we use `iloc`.
#### Accessing a row by label

In [None]:
 # Accessing a row by label
print(df.loc["person1"])

##### Accessing a row by integer position

In [None]:
 # Accessing a row by integer position
print(df.iloc[0])

### Accessing an individual element

In [None]:
# Accessing an individual element
print(df.at["person1", 'Name'])

## 4. Slicing

If we are only interested in the data from person 2 up to person 4 and their data in columns Name upto Age, we can create a subset of the data by slicing.

Using index labels (loc): We use the loc accessor to access rows and columns using their index labels.

### Labels and names: loc

In [None]:
# Using index labels for rows and columns: loc

df.loc["person2":"person4", 'Name':'Age']

df itself is not affected, we created a new dataframe. If you do not assign a new variable to this new dataframe, it will not be stored.

In [None]:
df

### Integer positions: iloc
We use the iloc accessor to access rows and columns using their integer positions.

Be careful! Integer positions start at 0 and the end position of the slice is exclusive!

In [None]:
# Using integer positions for rows and columns: iloc
df.iloc[1:4, 0:2]

### Boolean indexing
We can also use booleans, to hide or show rows or columns. In this example, the booleans show whether or not to incorporate a column.

In [None]:
df.loc["person2":"person4", [True, False, True]]

### Filtering rows based on a condition
We demonstrate filtering the DataFrame based on a condition. In this case, we select rows where the 'Age' column has a value greater than or equal to 30 using `df[df['Age'] >= 30]`. This operation returns a DataFrame containing only the rows that meet the specified condition.

In [None]:
df[df["Age"]>= 30]

### 💼 Exercise

Make exercise **2. More soccer**.

## 5. Operations

Pandas provides a plethora of built-in methods and functions for performing various data manipulation tasks. These include arithmetic operations, statistical aggregations, merging and joining datasets, handling missing data, and much more. Familiarizing yourself with these methods and understanding their behavior under different scenarios is crucial for efficient data analysis workflows.

### 5.1 Column operations

#### Adding a new column

In [None]:
# Adding a new column
df['Experience'] = [3, 5, 7, 4, 6]
df

#### Removing a column

In [None]:
df = df.drop(columns=["Experience"])


#### Sorting by column

In [None]:
# Sorting DataFrame by a column
sorted_df = df.sort_values(by='Age', ascending=False)
print(sorted_df)

### 5.2 Missing Data Handling

In [None]:
#Add a row with missing data
new_row = {"Name": "Mo", "Age": 23, "City": "Leuven", "Experience": pd.NA}
df.loc["person6"] = new_row
df

In [None]:
# Dropping rows with missing values
df.dropna()


Again, df itself is not affected, unless you assign the output dataframe to the variable df.

In [None]:
df

In [None]:
df = df.dropna()
df

In [None]:
#Let's bring person 6 back
df.loc["person6"] = new_row

# Filling missing values with a specified value
p = df.fillna(0)
p

### 5.3 Arithmetic operations
We demonstrate various arithmetic operations on DataFrame columns, including addition, subtraction, multiplication, and division.

#### Create new dataframe

In [None]:
#Let's first create a new DataFrame with test results
grades = {"student": ["Mia", "Mats", "Malik", "Mona", "Mimi"],
      "math": [7,5,6,8,9],
      "Eng": [9,8,6,7,7],
      "Geo": [10,7,8,6,9]}
df_grades = pd.DataFrame(grades)
df_grades

In [None]:
df_grades = df_grades.set_index("student" ) # use the column student as row labels
df_grades.index.name = None # remove the index column name
df_grades

#### 


In [None]:
# Addition
print(df_grades['math'] + df_grades['Eng'])  # Adds columns math and Eng element-wise


In [None]:
# Subtraction
print(df_grades['math'] - df_grades['Eng'])  # Subtracts column Eng from column math element-wise


In [None]:
# Multiplication
print(df_grades['math'] * df_grades['Eng'])  # Multiplies columns math and Eng element-wise


In [None]:
# Division
print(df_grades['math'] / df_grades['Eng'])  # Divides column math by column Eng element-wise


### 5.4 Aggregation
Transforming data into a summary statistic by applying specific functions on (a subset of) the data.

Example
`dataset = [1,2,3,4]` (Here we have multiple values) \
`sum(dataset)` --> 10 (Reduced to a single value, a summary statistic)

Example of such functions are:
sum, min, max, mean, size, describe, first, last, count, std, var, sem...


In [None]:
print(df_grades["math"].sum())
print(df_grades.sum())
print(df_grades.min())
print(df_grades.max())



In [None]:
# Statistical aggregations, claculate mean and median for every column

print(df_grades.mean())
print(df_grades.median())

The ***describe()*** method generates summary statistics (count, mean, std, min, 25%, 50%, 75%, max) for numeric columns in the DataFrame.

In [None]:
df_grades.describe()

We use **`agg()`** to calculate specified statistics of every column in our DataFrame.

In [None]:
df_grades.agg(['sum', 'min', 'max', 'mean', 'median'])

### 5.5 Grouping

A group by operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

In [None]:
#Count how many of the people live in each city

by_city = df.groupby("City").count()
by_city

In [None]:
df.groupby("Age").count()

In [None]:
# Group by City and Age
df.groupby(["City", "Age"]).count()

In [None]:
# Group by City and Age and only display one column with the count
df.groupby(["City", "Age"])["Name"].count()

In [None]:
# Grouping by City and calculating the mean Age
avg_age_by_city = df.groupby('City')['Age'].mean()
avg_age_by_city

#### 💼 Exercise
Make exercise 3. **Employees**.

### 5.6 Attributes and underlying data
A DataFrame object has several attributes: index, columns, values, dtypes, axes, ndim, size, shape, empty, head, tail...

In [None]:
#Let's play around
df.shape

Use ***info()*** method to display information about the DataFrame

In [None]:
df.info()

Use **head(x)** method to display the first x rows of the data. If no argument is provided, the first 5 rows are shown.

In [None]:
df.head()

#### Difference between .size() and .count()

`.count()` does not include null values, only valid values. `.size()` counts null values as well. 
This explains the different outcome of ``df.groupby(["City"]).count()` and `df.groupby(["City"]).size()`. .size() counts null values as well, therefore it displays the total number of entries for each city. .count() on the otherhand, does only count non-null values, therefore a count for every data column is given.

In [None]:
df.groupby(["City"]).size()

In [None]:
df.groupby(["City"]).count()

### 5.7 Merging and joining DataFrames

In [None]:
# Concatenating along columns
df_concat = pd.concat([df, df], axis=1)  # Concatenates the DataFrame with itself along columns
df_concat

In [None]:
# Concatenating along rows
df_concat = pd.concat([df, df], axis=0)  # Concatenates the DataFrame with itself along rows
df_concat

#### 💼 Exercise

Make exercise **4. Sales data** part one.

## 6. csv data
### Reading data from a csv file

Imagine having a dictionary or a list of lists with 1000 entries... Not really practical, right? Luckily we can read in data from files to create dataframes. Take a look at the code cell below. Explain the function of every argument provided.

In [None]:
df = pd.read_csv('soccer_data.csv', sep=",", index_col=0)


In [None]:
#Your explanation

❓How to create a df if no column names are provided in the csv?

🧰 **Task** Inspect the data showing the first 8 data entries.

In [None]:
#Inspect data


🧰 **Task** Remove the index column name.

In [None]:
# Your code

### Writing to a csv file

We can also write the data of our dataframe to a file. \
\
🧰  **Task** \
Write the data of your grades dataframe to a csv file named "myoutput".
Search <a href= https://pandas.pydata.org/docs/reference/index.html#api>the Pandas documentation </a> for the correct syntax. 

In [None]:
#write to csv


#### 💼 Exercise

Make the second part of exercise **4. Sales data**.