# Introduction to Pandas: Data Manipulation and Analysis Made Easy

## Introduction

**Pandas** is one of the most powerful and widely used Python libraries for data analysis and manipulation. It provides data structures like **DataFrames** and **Series**, which make it easy to handle structured data such as CSV, Excel, and SQL databases. In this topic, we'll explore the core functionality of Pandas for data analysis, including data loading, exploration, manipulation, and basic statistical analysis.

<b><i>Note:</b></i> Before starting, it is important to review the data and the columns available in the provided sales dataset. This initial exploration will give you a better understanding of the dataset's structure and contents, which is crucial before performing any data manipulation or analysis.
 


## 1. Installing Pandas
If you don’t have Pandas installed, open the Anaconda Command Prompt.

Type `pip install pandas` and press enter.

Once installed, you can import Pandas into your Python environment like this:

In [44]:
import pandas as pd

## 2. Creating a DataFrame

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types (like a table in a database or a spreadsheet such as Excel or Google Sheets). The most common way to create a DataFrame is by using a Python dictionary, where the keys represent column names and the values are lists containing the data for each column.

In [45]:
# Creating a dictionary of data
data = {
    'Name': ['Ali', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Occupation': ['Data Analyst', 'Data Scientist', 'Data Engineer']
}

# Creating the DataFrame
dataframe = pd.DataFrame(data)

# Displaying the DataFrame
dataframe


Unnamed: 0,Name,Age,Occupation
0,Ali,25,Data Analyst
1,Bob,30,Data Scientist
2,Charlie,35,Data Engineer


### Try it Yourself: Creating a DataFrame
Create a DataFrame called "<b>Quarter_Sales</b>" that stores the following business data:
- `Salesperson`: Daniel, Richard, Chen, Shahrukh 
- `Region`: North, South, East, West
- `Q1 Sales`: 12000, 10000, 15000, 13000

In [46]:
#Write your answer here... 

## 3. Loading Data from a CSV File
Pandas can load data from various file formats such as CSV, Excel, and SQL. The most common format is CSV, and the function to read CSV files is `pd.read_csv(file_path)`.

In [47]:
# Load a CSV file
df = pd.read_csv("C:/Users/Admin/Desktop/Datapre8/Code Institute/Module  - Getting Started with Python Programming/Sales_Data.csv")
#Change the path to your file and ensure that you use forward slash '/' in the file path instead of the default backward slash '\'

# Print the Dataset
print(df)


   Product_ID Product_Name  Sales  Profit Region
0         101       Laptop   1200     200  North
1         102   Smartphone    800     100  South
2         103       Tablet   1500     250   East
3         104   Headphones    300      50   West
4         105       Camera    700      80  North
5         106      Monitor    500      60  South
6         107     Keyboard    400      40   West
7         108        Mouse    600      70  North
8         109      Charger    450      55   East
9         110       Router    850     110  South


#### Explanation:
- `pd.read_csv()`: Reads a CSV file into a Pandas DataFrame.
- `df.head()`: Displays the first 5 rows of the DataFrame to give a quick overview of the data.

## 4. Exploring the Data

Once the data is loaded into a Pandas DataFrame, it is essential to explore its structure and contents to understand the dataset better. Pandas offers several useful methods for exploring the data.

### a) Viewing the First Few Rows

You can use the `head()` method to view the first few rows of the DataFrame, which provides a quick preview of the data.

```python
# Display the first 5 rows of the DataFrame
print(df.head())


In [48]:
print(df.head())

   Product_ID Product_Name  Sales  Profit Region
0         101       Laptop   1200     200  North
1         102   Smartphone    800     100  South
2         103       Tablet   1500     250   East
3         104   Headphones    300      50   West
4         105       Camera    700      80  North


### b) Shape of the Data
The `df.shape` attribute provides the number of rows and columns in the DataFrame.



In [49]:
# Check the shape of the DataFrame (rows, columns)
print(df.shape)

(10, 5)


### c) Getting Information about the DataFrame
The `info()` method gives a summary of the DataFrame, including the column names, data types, and the number of non-null values in each column.

In [50]:
# Get a summary of the DataFrame's structure
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Product_ID    10 non-null     int64 
 1   Product_Name  10 non-null     object
 2   Sales         10 non-null     int64 
 3   Profit        10 non-null     int64 
 4   Region        10 non-null     object
dtypes: int64(3), object(2)
memory usage: 528.0+ bytes
None


### d) Descriptive Statistics
The `describe()` method generates descriptive statistics for numerical columns, such as the mean, standard deviation, minimum, and maximum values.

In [51]:
# View statistical summary of numerical columns
print(df.describe())

       Product_ID      Sales      Profit
count    10.00000    10.0000   10.000000
mean    105.50000   730.0000  101.500000
std       3.02765   376.5339   69.604039
min     101.00000   300.0000   40.000000
25%     103.25000   462.5000   56.250000
50%     105.50000   650.0000   75.000000
75%     107.75000   837.5000  107.500000
max     110.00000  1500.0000  250.000000


## 5. Selecting Data
You can select specific columns, rows, or subsets of data from the DataFrame using labels or positions.

### a) Selecting Single Column

In [52]:
# Select a single column
df['Sales']

0    1200
1     800
2    1500
3     300
4     700
5     500
6     400
7     600
8     450
9     850
Name: Sales, dtype: int64

#### Explanation:
- df['Sales']: Selects the "Sales" column.

### Try it Yourself: Selecting Single Column
- Select the column 'Product_Name'.

In [53]:
#Write your answer here...

### b) Selecting Multiple Columns

In [54]:
# Select multiple columns
df[['Sales', 'Profit']]

Unnamed: 0,Sales,Profit
0,1200,200
1,800,100
2,1500,250
3,300,50
4,700,80
5,500,60
6,400,40
7,600,70
8,450,55
9,850,110


#### Explanation:
- df['Sales','Profit']: Selects multiple columns.

### Try it Yourself: Selecting Multiple Columns
- Select the columns 'Sales' and 'Region'.

In [55]:
#Write your answer here...

### c) Selecting Rows Using .loc[] and .iloc[]
- `.loc[]`: Select rows and columns by labels.
- `.iloc[]`: Select rows and columns by integer positions.

### Selecting Rows using `.loc[]`

In [56]:
# Select rows by index label (e.g., row with index 2)
row = df.loc[2]
print(row)

Product_ID         103
Product_Name    Tablet
Sales             1500
Profit             250
Region            East
Name: 2, dtype: object


#### Explanation:
- df.loc[]: Accesses rows/columns by labels (index and column names).

### Try it Yourself: Selecting Rows using .loc[]
- select the row with index 4

In [57]:
#Write your answer here... 


### Selecting Rows using `.iloc[]`

In [58]:
# Select rows by position (e.g., first three rows)
subset_rows = df.iloc[:3]
print(subset_rows)

   Product_ID Product_Name  Sales  Profit Region
0         101       Laptop   1200     200  North
1         102   Smartphone    800     100  South
2         103       Tablet   1500     250   East


#### Explanation:
- df.iloc[]: Accesses rows/columns by integer position (index numbers).

### Try it Yourself: Selecting Rows using .iloc[]
- select the first 6 rows

In [59]:
#Write your answer here...


## 6. Data Filtering and Conditional Selection
You can filter the data based on conditions using boolean indexing.

### a) Filtering Using Comparison Operator

In [60]:
# Filter rows where Sales > 500
high_sales = df[df['Sales'] > 500]
print(high_sales)

   Product_ID Product_Name  Sales  Profit Region
0         101       Laptop   1200     200  North
1         102   Smartphone    800     100  South
2         103       Tablet   1500     250   East
4         105       Camera    700      80  North
7         108        Mouse    600      70  North
9         110       Router    850     110  South


#### Explanation:
- `df['Sales'] > 500`: Creates a boolean series where the condition is True for rows that meet the condition.

### Try it Yourself: Data Filtering using Comparison Operator
- Filter rows where the sale is greater than 1000

In [61]:
#Write your answer here...


### b) Data Filtering including a Logical Operator
Use `&` (and), `|` (or) operators for combining conditions.

In [62]:
# Filter rows where both Sales > 500 and Profit > 100
high_sales_profit = df[(df['Sales'] > 500) & (df['Profit'] > 100)]
print(high_sales_profit)

   Product_ID Product_Name  Sales  Profit Region
0         101       Laptop   1200     200  North
2         103       Tablet   1500     250   East
9         110       Router    850     110  South


### Try it Yourself: Data Filtering including a Locial Operator
- Filter rows where Sales > 1000 or profit <300

In [63]:
#Write your answer here...


## 7. Data Manipulation

### a) Adding New Columns
You can easily add new columns to a DataFrame.

In [64]:
# Add a new column called 'Cost'
df['Cost'] = df['Sales'] - df['Profit']
print(df)

   Product_ID Product_Name  Sales  Profit Region  Cost
0         101       Laptop   1200     200  North  1000
1         102   Smartphone    800     100  South   700
2         103       Tablet   1500     250   East  1250
3         104   Headphones    300      50   West   250
4         105       Camera    700      80  North   620
5         106      Monitor    500      60  South   440
6         107     Keyboard    400      40   West   360
7         108        Mouse    600      70  North   530
8         109      Charger    450      55   East   395
9         110       Router    850     110  South   740


### Try it Yourself: Adding New Columns
- Add a new column called profit(%). 
- Formula = (Profit / Sales) * 100

In [65]:
#Write your answer here...


### b) Modifying Existing Columns

In [66]:
# Modify the 'Sales' column by applying a 10% discount
df['Sales'] = df['Sales'] * 0.9
print(df)

   Product_ID Product_Name   Sales  Profit Region  Cost
0         101       Laptop  1080.0     200  North  1000
1         102   Smartphone   720.0     100  South   700
2         103       Tablet  1350.0     250   East  1250
3         104   Headphones   270.0      50   West   250
4         105       Camera   630.0      80  North   620
5         106      Monitor   450.0      60  South   440
6         107     Keyboard   360.0      40   West   360
7         108        Mouse   540.0      70  North   530
8         109      Charger   405.0      55   East   395
9         110       Router   765.0     110  South   740


### Try it Yourself: Modifying Existing Columns
- Add 100 as an additional cost for each product. 

In [67]:
#Write your answer here... 

### c) Removing Columns

To remove one or more columns in Pandas, you can use the `drop()` method. This method allows you to specify which columns to drop and whether to drop them from rows or columns.

#### Syntax:
```python
df.drop(columns=['column_name1', 'column_name2'])


In [68]:
# Drop the 'Cost' column from the DataFrame
df_without_cost = df.drop(columns=['Cost'])

# Display the DataFrame without the 'Profit' column
print(df_without_cost.head())


   Product_ID Product_Name   Sales  Profit Region
0         101       Laptop  1080.0     200  North
1         102   Smartphone   720.0     100  South
2         103       Tablet  1350.0     250   East
3         104   Headphones   270.0      50   West
4         105       Camera   630.0      80  North


### Try it Yourself: Removing a Column
- Remove the column Product_ID

In [69]:
#Write your answer here... 


### d) Renaming Columns in Pandas

In Pandas, you can rename one or more columns using the `rename()` method. This is particularly useful for improving the readability of your DataFrame or aligning column names with specific naming conventions.

#### Syntax:
```python
df.rename(columns={'old_column_name': 'new_column_name'})


In [70]:
# Rename the 'Sales' column to 'Total_Sales'
df_renamed = df.rename(columns={'Sales': 'Total_Sales'})

# Display the updated DataFrame
print(df_renamed.head())


   Product_ID Product_Name  Total_Sales  Profit Region  Cost
0         101       Laptop       1080.0     200  North  1000
1         102   Smartphone        720.0     100  South   700
2         103       Tablet       1350.0     250   East  1250
3         104   Headphones        270.0      50   West   250
4         105       Camera        630.0      80  North   620


### e) Renaming Multiple Columns
You can rename multiple columns at once by passing a dictionary of old and new column names.

In [71]:
# Rename multiple columns
df_renamed_multiple = df.rename(columns={'Sales': 'Total_Sales', 'Profit': 'Net_Profit'})

# Display the updated DataFrame
print(df_renamed_multiple.head())


   Product_ID Product_Name  Total_Sales  Net_Profit Region  Cost
0         101       Laptop       1080.0         200  North  1000
1         102   Smartphone        720.0         100  South   700
2         103       Tablet       1350.0         250   East  1250
3         104   Headphones        270.0          50   West   250
4         105       Camera        630.0          80  North   620


### Try it Yourself: Renaming the Column
- Rename the column Net_Profit to Profit

In [72]:
#Write your answer here...


## 8. Grouping and Aggregating Data

Pandas allows you to group data based on certain criteria and then apply aggregation functions (like `sum()`, `mean()`, or `count()`) to summarise the data. This is extremely useful when working with large datasets and when you want to perform group-level operations.

### Common Aggregation Functions
There are some common aggregation functions that can be applied to grouped data:

- `sum()`: Calculate the total sum of the values.
- `mean()`: Calculate the mean (average) of the values.
- `count()`: Count the number of non-null observations.
- `min()` and `max()`: Find the minimum and maximum values.

### a) Grouping Data

You can group data based on one or more columns using the `groupby()` function. This is often followed by an aggregation operation.

In [73]:
# Group data by 'Region' and calculate total sales for each region
region_sales = df.groupby('Region')['Sales'].sum()

print(region_sales)


Region
East     1755.0
North    2250.0
South    1935.0
West      630.0
Name: Sales, dtype: float64


#### Explanation:
- `df.groupby('Region')`: Groups the data by the 'Region' column.
- `['Sales'].sum()`: Calculates the sum of the 'Sales' column for each group (region).

### Try it Yourself: Grouping Data
- Group the data by 'Region' and calculate the profit for each region

In [74]:
#Write your answer here...

### b) Grouping by Multiple Columns
You can also group by multiple columns to perform more complex grouping operations.

In [75]:
# Group by 'Region' and 'Product_Name' and calculate the sum of sales
region_product_sales = df.groupby(['Region', 'Product_Name'])['Sales'].sum()
print(region_product_sales)

Region  Product_Name
East    Charger          405.0
        Tablet          1350.0
North   Camera           630.0
        Laptop          1080.0
        Mouse            540.0
South   Monitor          450.0
        Router           765.0
        Smartphone       720.0
West    Headphones       270.0
        Keyboard         360.0
Name: Sales, dtype: float64


#### Explanation:
- `df.groupby(['Region', 'Product_Name'])`: Groups the data first by 'Region' and then by 'Product_Name'.
- `['Sales'].sum()`: Calculates the total sales for each combination of region and product.

### Try it Yourself: Grouping by Multiple Columns
- Group the data by 'Region' and 'Product Name' and calculate the profit for each region

In [76]:
#Write your answer here... 


In [77]:
# Group by 'Region' and apply multiple aggregation functions to the 'Sales' column
aggregated_sales = df.groupby('Region')['Sales'].agg(['sum', 'mean', 'count'])
print(aggregated_sales)

           sum   mean  count
Region                      
East    1755.0  877.5      2
North   2250.0  750.0      3
South   1935.0  645.0      3
West     630.0  315.0      2


#### Explanation:
- `agg(['sum', 'mean', 'count'])`: Applies multiple aggregation functions (sum, mean, and count) to the grouped data.
- This generates a summary of total sales, average sales, and the number of entries for each region.

### Try it Yourself: Common Aggregation Functions
- Group by 'Region' and apply multiple aggregation functions the 'Profit' Column

In [78]:
#Write your answer here... 

## 9) Writing the File
Once your DataFrame is ready, you can write it to a CSV file using the to_csv() function. Here’s how:

#### Syntax 

```python
df.to_csv('filepath/filename.csv', index=False)


In [79]:
# Write DataFrame to a CSV file
df.to_csv('C:/Users/Admin/Desktop/Datapre8/Code Institute/Module  - Getting Started with Python Programming/my_data.csv', index=False)


#### Explanation: 
- The index=False argument is optional, but it prevents the DataFrame index from being written into the CSV file.

# What's Next? 

Next, we’ll dive into data visualization in Python, where you’ll learn to create clear and insightful visuals using libraries like Seaborn. Let’s explore how to bring data to life!