# Data Manipulation with Pandas

Data manipulation is one of the core skills in data science and machine learning, and Pandas is the go-to library in Python for these tasks. Pandas provides numerous functions and methods to efficiently process, clean, and analyze data. Here's an overview of key data manipulation tasks you can perform with Pandas:

## Setup
First, you need to install and import Pandas. If you haven't installed it, you can do so using pip:

In [None]:
# !pip install pandas

Now, import `pandas` (usually imported as `pd`):

## 1. Data Loading
Pandas supports reading from and writing to a variety of data formats, including `CSV`, `Excel`, `JSON`, and `SQL` databases.

- `read_csv`, `read_excel`, `read_json`, `read_sql` for loading data.
- `to_csv`, `to_excel`, `to_json`, `to_sql` for saving data.

In [None]:
df = pd.read_csv('data.csv')

## 2. Data Inspection
Once you've loaded your data into a DataFrame (Pandas' primary data structure), you can start inspecting the data.

- `head()` and `tail()` to view the first and last rows of the DataFrame.
- `info()` to get a concise summary of the DataFrame, including the number of non-null values in each column and memory usage.
- `describe()` to view the statistical summary of numerical columns.

In [None]:
# Inspect the first few rows of the DataFrame.
print(df.head())

# View a concise summary of the DataFrame.
print(df.info())

## 3. Data Cleaning
Real-world data is often messy. Pandas offers powerful tools for cleaning data.

- Handling missing data using methods like `isnull()`, `notnull()`, `dropna()`, and `fillna()`.
- Removing duplicates with `drop_duplicates()`.
- Renaming columns for better readability using `rename()`.

In [None]:
# Remove rows with missing values.
df = df.dropna()

# Fill missing values with a default value.
df = df.fillna(value=0)

# Remove duplicate rows.
df = df.drop_duplicates()

## 4. Data Filtering and Selection
Pandas allows for both simple and complex indexing and selection operations.

- Selecting specific columns or rows using column names, indices, or conditions.
- Advanced indexing with `.loc[]` (label-based) and `.iloc[]` (integer-based).
- Conditional selection using boolean arrays.

In [None]:
# Select a specific column. 
ages = df['Age']

# Select rows based on a condition.
youngsters = df[df['Age'] < 18]

## 5. Data Transformation
Transforming your data is essential in the data preparation phase before analysis or modeling.

- Creating new columns based on existing data.
- Applying functions to columns or rows using `apply()`.
- Aggregating data using groupby() and summarizing using aggregation functions like `sum()`, `mean()`, `max()`, `min()`, and custom functions.

In [None]:
# Create a new column based on an existing one.
df['AgeInMonths'] = df['Age'] * 12

# Apply a function to a column.
df['AgeSquared'] = df['Age'].apply(lambda x: x**2)

# Aggregate data after grouping.
grouped = df.groupby('Department')
print(grouped['Salary'].mean())

## 6. Merging and Joining
Combining data from multiple sources is a common task in data analysis.

- Concatenating dataframes vertically or horizontally using `pd.concat()`.
- Merging dataframes based on a common key (similar to SQL joins) using `merge()`.

In [None]:
# Merge two DataFrames on a key column.
merged_df = pd.merge(df1, df2, on='EmployeeID')

## 7. Pivoting and Reshaping
Changing the structure and layout of data can provide fresh insights.

- Pivoting dataframes with `pivot()` or `pivot_table()`.
- Melting dataframes (turning columns into rows) using `melt()`.

In [None]:
# Pivot a table based on column values.
pivoted = df.pivot_table(index='Date', columns='Department', values='Sales')

## 8. Working with Time Series
Pandas has robust features for working with time-series data.

- Time-based indexing and slicing.
- Resampling and frequency conversion using methods like `resample()` or `asfreq()`.
- Rolling window calculations with `rolling()`.

In [None]:
# Convert a column to datetime and set it as index.
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Resample data by month and calculate the mean.
monthly_mean = df.resample('M').mean()

## 9. Exporting Data
After manipulation, you can easily export your data to various formats for further use or reporting.

- Using `to_csv`, `to_excel`, `to_json`, `to_sql` to export DataFrames to respective formats.

In [None]:
# Export the DataFrame to a new CSV file.
df.to_csv('processed_data.csv')

Pandas is incredibly powerful for data manipulation, providing a wide array of functions to efficiently process, clean, and transform data, setting the stage for in-depth analysis and modeling. It's a staple tool in the toolkit of data scientists and analysts.