# **Pandas**

## **1. Creating, Reading, and Writing Data**

### Creating a DataFrame

A DataFrame is the main data structure in Pandas, which represents a table with rows and columns. A DataFrame can be created from various sources: dictionaries, lists, NumPy arrays, and other data structures.

#### Example 1: Creating a DataFrame from a Dictionary

```python
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alexey', 'Maria', 'Ivan'],
    'Age': [25, 30, 22],
    'City': ['Moscow', 'St. Petersburg', 'Novosibirsk']
}

df = pd.DataFrame(data)
print(df)
```

#### Example 2: Creating a DataFrame from a List of Lists

```python
# Creating a DataFrame from a list of lists
data = [
    ['Alexey', 25, 'Moscow'],
    ['Maria', 30, 'St. Petersburg'],
    ['Ivan', 22, 'Novosibirsk']
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)
```

#### Example 3: Creating a DataFrame with Specified Indices

```python
# Creating a DataFrame with specified indices
data = [
    ['Alexey', 25, 'Moscow'],
    ['Maria', 30, 'St. Petersburg'],
    ['Ivan', 22, 'Novosibirsk']
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'], index=['a', 'b', 'c'])
print(df)
```

### Reading Data from a File

Pandas supports reading data from various formats such as CSV, Excel, JSON, and others.

#### Example 4: Reading Data from a CSV File

```python
# Reading a CSV file
df = pd.read_csv('data.csv')
print(df)
```

#### Example 5: Reading Data from an Excel File

```python
# Reading an Excel file
df = pd.read_excel('data.xlsx')
print(df)
```

#### Example 6: Reading Data from a JSON File

```python
# Reading a JSON file
df = pd.read_json('data.json')
print(df)
```

### Writing Data to a File

Pandas also supports writing data to various formats.

#### Example 7: Writing a DataFrame to a CSV File

```python
# Writing a DataFrame to a CSV file
df.to_csv('output.csv', index=False)

# Note: The `index=False` parameter indicates that the index should not be written to the file.
```

#### Example 8: Writing a DataFrame to an Excel File

```python
# Writing a DataFrame to an Excel file
df.to_excel('output.xlsx', index=False)
```

#### Example 9: Writing a DataFrame to a JSON File

```python
# Writing a DataFrame to a JSON file
df.to_json('output.json')
```

```python
df.to_json('output.json', orient='records', indent=4, force_ascii=False)
```

Parameters of the `to_json()` method

`orient`:
- 'records' – saves data as a list of dictionaries (each dictionary is a row in the DataFrame).
- 'split' – saves data as separate keys: index, columns, data.
- 'index' – uses indices as top-level keys.
- 'columns' – uses column names as top-level keys.
- 'values' – saves only values (without indices and column names).
- 'table' – saves data in a format compatible with Table Schema.

`indent`:
- Specifies the number of spaces for indentation. If indent is not specified, JSON will be saved in one line.

`force_ascii`:
- If False, non-ASCII characters (e.g., Cyrillic) are saved as is. If True, they are converted to escape sequences.

## **2. Indexing, Selection, and Assignment**

In Pandas, there are several ways to select data from a DataFrame. In this section, we will look at the main methods of indexing, selection, and data assignment.

### Selecting Columns

#### Example 1: Selecting a Single Column
To select a single column, use the syntax `df['column_name']`.

```python
import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alexey', 'Maria', 'Ivan'],
    'Age': [25, 30, 22],
    'City': ['Moscow', 'St. Petersburg', 'Novosibirsk']
}
df = pd.DataFrame(data)

# Selecting the "Name" column
names = df['Name']
print(names)
```

#### Example 2: Selecting Multiple Columns
To select multiple columns, use the syntax `df[['column1', 'column2']]`.

```python
# Selecting the "Name" and "City" columns
subset = df[['Name', 'City']]
print(subset)
```

### Selecting Rows

#### Example 3: Selecting Rows by Index Using `.iloc`
The `.iloc` method is used to select rows by their numerical index.

```python
# Selecting the first row
first_row = df.iloc[0]
print(first_row)
```

#### Example 4: Selecting Multiple Rows by Index

```python
# Selecting the first two rows
first_two_rows = df.iloc[:2]
print(first_two_rows)
```

#### Example 5: Selecting Rows by Condition
To select rows by condition, use boolean indexing.

```python
# Selecting rows where age is greater than 25
adults = df[df['Age'] > 25]
print(adults)
```

### Selecting Specific Cells

#### Example 6: Selecting a Value by Row and Column Index Using `.iloc`

```python
# Selecting a value from the first row and second column
value = df.iloc[0, 1]
print(value)
```

#### Example 7: Selecting a Value by Row and Column Label Using `.loc`
The `.loc` method is used to select data by row and column labels.

```python
# Selecting a value from the row with index 0 and the "Age" column
value = df.loc[0, 'Age']
print(value)
```

### Assigning Values

#### Example 8: Assigning a Value to an Entire Column

```python
# Changing values in the "Age" column
df['Age'] = [26, 31, 23]
print(df)
```

#### Example 9: Assigning a Value to a Specific Cell

```python
# Changing a value in the first row and "Age" column
df.at[0, 'Age'] = 27
print(df)
```

#### Example 10: Assigning Values Using a Condition

```python
# Increasing age by 1 for everyone older than 25
df.loc[df['Age'] > 25, 'Age'] += 1
print(df)
```

#### Example 12: Selecting Rows Using `.isin`
The `.isin` method allows selecting rows where the column value is in a given list.

```python
# Selecting rows where the city is Moscow or Novosibirsk
result = df[df['City'].isin(['Moscow', 'Novosibirsk'])]
print(result)
```

## **3. Summary Functions and Maps**

In Pandas, there are many functions for analyzing and transforming data. In this section, we will look at the main summary functions and methods for applying functions to data (maps).

### Summary Functions (Summary Functions)

Summary functions allow you to quickly get statistical information about the data.

#### Example 1: Basic Summary Functions

```python
import pandas as pd

df = pd.read_csv('data.csv')

# Mean value
mean_age = df['Age'].mean()
print(f"Average age: {mean_age}")

# Median
median_salary = df['Salary'].median()
print(f"Median salary: {median_salary}")

# Minimum value
min_age = df['Age'].min()
print(f"Minimum age: {min_age}")

# Maximum value
max_salary = df['Salary'].max()
print(f"Maximum salary: {max_salary}")

# Standard deviation
std_age = df['Age'].std()
print(f"Standard deviation of age: {std_age}")
```

#### Example 2: Descriptive Statistics Using `.describe()`

The `.describe()` method provides basic statistical characteristics for numerical columns.

```python
# Descriptive statistics
stats = df.describe()
print(stats)
```

```python
# Getting information about the data
print(df.info())
```

#### Example 3: Unique Values and Their Count

```python
# Unique values in the "Name" column
unique_names = df['Name'].unique()
print(f"Unique names: {unique_names}")

# Number of unique values
num_unique_names = df['Name'].nunique()
print(f"Number of unique names: {num_unique_names}")

# Frequency of each unique value
name_counts = df['Name'].value_counts()
print(name_counts)
```

### Applying Functions to Data (Maps)

#### Example 4: Applying a Function to a Column Using `.apply()`

The `.apply()` method allows applying a function to each element of a column.

```python
# Increasing age by 1 year
df['Age'] = df['Age'].apply(lambda x: x + 1)
print(df)
```

### Working with Categorical Data

#### Example 7: Converting Data to Categories

```python
# Converting the "Name" column to a categorical type
df['Name'] = df['Name'].astype('category')
print(df['Name'].cat.categories)
```

#### Example 8: Applying a Function to Categorical Data

```python
# Applying a function to categorical data
df['Name'] = df['Name'].apply(lambda x: x.lower())
print(df)
```

### Working with Missing Values

#### Example 9: Filling Missing Values

```python
# Adding missing values
df.loc[1, 'Salary'] = None

# Filling missing values with the mean value
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
print(df)
```

## **4. Grouping and Sorting**

Grouping and sorting data are important operations for data analysis and processing. In Pandas, the `.groupby()` and `.sort_values()` methods are used for this.

### Grouping Data (Grouping)

Grouping allows dividing data into groups based on certain criteria and performing aggregation (e.g., counting, summing, finding the average, etc.).

#### Example 1: Grouping by a Single Column

```python
import pandas as pd

df = pd.read_csv('data.csv')

# Grouping by the "City" column and calculating the sum of sales
grouped = df.groupby('City')['Salary'].max()
print(grouped)
```

#### Example 3: Aggregating Data After Grouping

The `.agg()` method allows applying multiple aggregation functions simultaneously.

```python
# Grouping by the "City" column and applying multiple aggregation functions
grouped = df.groupby('City')['Salary'].agg(['sum', 'mean', 'count'])
print(grouped)
```

### Sorting Data (Sorting)

Sorting data allows ordering the rows of a DataFrame based on the values of one or more columns.

#### Example 5: Sorting by a Single Column

```python
# Sorting by the "Sales" column in ascending order
sorted_df = df.sort_values(by='Salary')
print(sorted_df)
```

#### Example 6: Sorting by Multiple Columns

```python
# Sorting by the "City" and "Sales" columns
sorted_df = df.sort_values(by=['City', 'Salary'])
print(sorted_df)
```

#### Example 7: Sorting in Descending Order

```python
# Sorting by the "Sales" column in descending order
sorted_df = df.sort_values(by='Salary', ascending=False)
print(sorted_df)
```

### Combining Grouping and Sorting

#### Example 8: Grouping and Sorting Results

```python
# Grouping by the "City" column and calculating the sum of sales
grouped = df.groupby('City')['Salary'].sum()

# Sorting the grouping results
sorted_grouped = grouped.sort_values(ascending=False)
print(sorted_grouped)
```

#### Example 9: Grouping, Aggregation, and Sorting

```python
# Grouping by the "City" column, aggregation, and sorting
result = df.groupby('City').agg({
    'Salary': ['sum', 'mean'],
    'Name': 'count'
}).sort_values(by=('Salary', 'sum'), ascending=False)
print(result)
```

```python
### **Counting the Number of People in Each City**
df_counts = df['City'].value_counts()
print(df_counts)
```

## **5. Data Types and Missing Values**

Working with data types and missing values is an important part of data processing. Pandas provides powerful tools for managing data types and handling missing values.

### Data Types

Each column in a DataFrame has a specific data type. Pandas automatically infers data types when creating a DataFrame, but they can be manually changed.

#### Example 1: Viewing Data Types

```python
import pandas as pd

# Create a DataFrame
data = {
    'Name': ['Alexey', 'Maria', 'Ivan'],
    'Age': [25, 30, 22],
    'Salary': [50000.0, 60000.0, 45000.0],
    'Hire_Date': ['2021-01-01', '2020-05-15', '2019-11-20']
}
df = pd.DataFrame(data)

# View data types
print(df.dtypes)
```

#### Example 2: Changing Data Types

```python
# Change the data type of the "Age" column to float
df['Age'] = df['Age'].astype(float)
print(df.dtypes)
```

#### Example 3: Converting Strings to Dates

```python
# Convert the "Hire_Date" column to datetime type
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'])
print(df.dtypes)
```

### Missing Values

Missing values can occur for various reasons, and Pandas provides tools to handle them.

#### Example 5: Detecting Missing Values

```python
# Add missing values
df.loc[1, 'Salary'] = None
df.loc[2, 'Age'] = None

# Check for missing values using isna and isnull
print(df.isna())
```

#### Example 6: Counting Missing Values

```python
# Count missing values in each column
print(df.isnull().sum())
```

#### Example 7: Filling Missing Values

```python
# Fill missing values in the "Salary" column with the mean value
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

# Fill missing values in the "Age" column with the median value
df['Age'] = df['Age'].fillna(df['Age'].median())

print(df)
```

#### Example 8: Removing Rows with Missing Values

```python
# Remove rows with any missing values
df_cleaned = df.dropna()
print(df_cleaned)
```

#### Example 9: Filling Missing Values Using Interpolation

```python
# Interpolate missing values
df['Age'] = df['Age'].interpolate()
print(df)
```

### Working with Categorical Data

#### Example 10: Converting Data to Categories

```python
# Convert the "Name" column to categorical type
df['Name'] = df['Name'].astype('category')
print(df['Name'].cat.categories)
```

#### Example 11: Adding a New Category

```python
# Add a new category
df['Name'] = df['Name'].cat.add_categories(['Olga'])
print(df['Name'].cat.categories)
```

### Additional Examples

#### Example 12: Replacing Values

```python
# Replace the value "Maria" with "Marina" in the "Name" column
df['Name'] = df['Name'].replace('Maria', 'Marina')
print(df)
```

## **6. Renaming and Combining**

Pandas provides convenient tools for renaming columns and indices, as well as for combining data from different sources. In this section, we will look at how to rename columns and indices, and how to combine DataFrames using various methods.

### Renaming

#### Example 1: Renaming Columns

```python
import pandas as pd

# Create a DataFrame
data = {
    'Name': ['Alexey', 'Maria', 'Ivan'],
    'Age': [25, 30, 22],
    'City': ['Moscow', 'St. Petersburg', 'Novosibirsk']
}
df = pd.DataFrame(data)

# Rename columns
df = df.rename(columns={
    'Name': 'Full Name',
    'Age': 'Age, years',
    'City': 'Residence'
})
print(df)
```

#### Example 2: Renaming Indices

```python
# Rename indices
df = df.rename(index={0: 'a', 1: 'b', 2: 'c'})
print(df)
```

#### Example 3: Renaming Using `.columns`

```python
# Rename columns directly using the .columns attribute
df.columns = ['Full Name', 'Age', 'City']
print(df)
```

### Combining Data

#### Example 4: Combining Rows with `pd.concat()`

```python
# Create two DataFrames
df1 = pd.DataFrame({
    'Name': ['Alexey', 'Maria'],
    'Age': [25, 30]
})

df2 = pd.DataFrame({
    'Name': ['Ivan', 'Olga'],
    'Age': [22, 28]
})

# Combine rows
result = pd.concat([df1, df2], ignore_index=True)
print(result)
```

#### Example 5: Combining Columns

```python
# Combine columns
result = pd.concat([df1, df2], axis=1)
print(result)
```

#### Example 6: Combining with `pd.merge()`

```python
# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alexey', 'Maria', 'Ivan']
})

df2 = pd.DataFrame({
    'ID': [1, 2, 4],
    'Salary': [50000, 60000, 45000]
})

# Inner join
result = pd.merge(df1, df2, on='ID', how='inner')
print(result)
```

#### Example 7: Combining with Different Types of JOIN

```python
# Left join
result = pd.merge(df1, df2, on='ID', how='left')
print(result)
print()

# Outer join
result = pd.merge(df1, df2, on='ID', how='outer')
print(result)
```

#### Example 8: Combining on Multiple Keys

```python
# Create two DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'City': ['Moscow', 'St. Petersburg', 'Novosibirsk'],
    'Name': ['Alexey', 'Maria', 'Ivan']
})

df2 = pd.DataFrame({
    'ID': [1, 2, 4],
    'City': ['Moscow', 'St. Petersburg', 'Kazan'],
    'Salary': [50000, 60000, 45000]
})

# Combine on multiple keys
result = pd.merge(df1, df2, on=['ID', 'City'], how='inner')
print(result)
```

## **7. Modifying Data**

```python
df = pd.read_csv('data.csv')
df.head()
```

### Adding New Columns

```python
df['Net Salary'] = df['Salary'] * 0.87  # Calculate income after tax
print(df.head())
```

### Removing Columns

```python
df.drop(columns=['Hire Date'], inplace=True)  # Remove the 'City' column
print(df.head())
```

### Removing Duplicates

Often, data contains duplicate rows that need to be removed.

#### Checking for Duplicates

```python
print(df.duplicated().sum())  # Number of duplicates
```

#### Removing Duplicates

```python
df = df.drop_duplicates()
# You can also remove duplicates based on specific columns:
df = df.drop_duplicates(subset=['Name'])
```