<a href="https://colab.research.google.com/github/ksandeep18/MachineLearning/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pandas is a crucial Python library for data manipulation and analysis, frequently used in data science roles.  Here's a breakdown of key concepts for placement preparation:

**1. Core Data Structures:**

* **Series:** A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).  Think of it as a column in a table. Key aspects include indexing (label-based and integer-based) and data type inference.
* **DataFrame:** A two-dimensional labeled data structure with columns of potentially different types.  This is the workhorse of Pandas and analogous to a spreadsheet or SQL table.  Understanding how to create, manipulate, and query DataFrames is essential.

**2. Data Input/Output:**

* Pandas excels at reading and writing data from various sources: CSV, Excel files, SQL databases, JSON, and more.  Be prepared to demonstrate proficiency in these operations, including specifying delimiters, handling headers, and managing missing data during import.
* `pd.read_csv()`, `pd.read_excel()`, `pd.read_sql()`, `pd.to_csv()` are common methods.  Understand their parameters.


**3. Data Selection and Filtering:**

* **Indexing and Selecting Data:**  `.loc` (label-based) and `.iloc` (integer-based) are fundamental for accessing specific rows and columns.  Practice different slicing techniques.
* **Boolean Indexing:**  Filtering data based on conditions is a very important skill.  Understand how to create boolean masks and apply them to DataFrames to select subsets of data.
* **Conditional Selection:**  Using comparison operators (>, <, ==, !=) and logical operators (&, |, ~) to filter rows.

**4. Data Cleaning and Preprocessing:**

* **Handling Missing Data:** `isnull()`, `notnull()`, `dropna()`, `fillna()`.  Learn how to identify, remove, or impute missing values appropriately.
* **Data Type Conversion:**  `astype()` is useful for converting column data types (e.g., string to numeric).
* **Duplicate Handling:** `duplicated()`, `drop_duplicates()`.  Identify and remove duplicate rows.

**5. Data Wrangling and Transformation:**

* **Grouping and Aggregation:**  `groupby()` allows for grouping data based on one or more columns and applying aggregate functions (e.g., sum, mean, count, min, max).
* **Applying Functions:**  `.apply()` and `.map()` are powerful for applying custom functions to data.
* **Pivot Tables:**  Create summary tables for data analysis using pivot and crosstab operations.
* **Joining/Merging DataFrames:**  `merge()` and `join()` to combine DataFrames based on shared columns.

**6. Data Visualization (Integration with Matplotlib/Seaborn):**

* While not strictly Pandas, understanding how to visualize data using Pandas in conjunction with plotting libraries like Matplotlib and Seaborn is a significant advantage.  Know how to create histograms, scatter plots, bar charts, and more.


**7. Performance Considerations:**

* Be aware of common performance bottlenecks and strategies for optimizing Pandas operations, especially with large datasets.


**Placement Tip:**  Practice using real-world datasets (Kaggle, UCI Machine Learning Repository) to build your portfolio and showcase your Pandas skills.  Be prepared to discuss your approach to data cleaning, analysis, and visualization in interviews.


In [2]:
import pandas as pd
import numpy as np

# 1. Series and DataFrames
# Create a Series
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print("Series:\n", s)

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 28],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
print("\nDataFrame:\n", df)


# 2. Data Input/Output
# Read a CSV file (replace with your file path)
# df_csv = pd.read_csv("your_file.csv")
# print("\nData from CSV:\n", df_csv.head())


# 3. Data Selection and Filtering
# .loc (label-based)
print("\n.loc:\n", df.loc[0:1, 'Name':'Age']) # select rows 0 to 1 (inclusive), columns 'Name' to 'Age'

# .iloc (integer-based)
print("\n.iloc:\n", df.iloc[0:2, 0:2]) # Select rows 0 to 1, columns 0 to 1

# Boolean Indexing
print("\nBoolean Indexing:\n", df[df['Age'] > 25])  # Filter rows where Age is greater than 25


# 4. Data Cleaning
# Handling missing data
df_with_nan = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
print("\nDataFrame with NaN:\n", df_with_nan)
print("\nDrop NaN rows:\n", df_with_nan.dropna())
print("\nFilling NaN with 0:\n", df_with_nan.fillna(0))

# Data Type Conversion
df['Age'] = df['Age'].astype(float) # Converting column to different type
print("\nModified data types:\n", df.dtypes)

#Duplicate Handling
df_duplicate = pd.DataFrame({'col1': [1, 2, 2, 3], 'col2': ['a', 'b', 'b', 'c']})
print("\nDataFrame with duplicates:\n",df_duplicate)
print("\nRemoved duplicates:\n", df_duplicate.drop_duplicates())


# 5. Data Wrangling and Transformation
# Grouping and Aggregation
print("\nGrouped data:\n", df.groupby('City')['Age'].mean())  # Calculate the mean Age per City

# Applying Functions
df['Age_squared'] = df['Age'].apply(lambda x: x**2)
print("\nAge squared:\n",df)


# 6. Visualization
# import matplotlib.pyplot as plt
# df.plot(x='Name', y='Age', kind='bar')
# plt.show()


# 7. Performance (Example with large data)
# For very large datasets, consider using optimized libraries like Dask.

# Placement Tip: Work with Kaggle datasets.


Series:
 0    10
1    20
2    30
3    40
4    50
dtype: int64

DataFrame:
       Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   22     Paris
3    David   28     Tokyo

.loc:
     Name  Age
0  Alice   25
1    Bob   30

.iloc:
     Name  Age
0  Alice   25
1    Bob   30

Boolean Indexing:
     Name  Age    City
1    Bob   30  London
3  David   28   Tokyo

DataFrame with NaN:
      A    B
0  1.0  4.0
1  2.0  NaN
2  NaN  6.0

Drop NaN rows:
      A    B
0  1.0  4.0

Filling NaN with 0:
      A    B
0  1.0  4.0
1  2.0  0.0
2  0.0  6.0

Modified data types:
 Name     object
Age     float64
City     object
dtype: object

DataFrame with duplicates:
    col1 col2
0     1    a
1     2    b
2     2    b
3     3    c

Removed duplicates:
    col1 col2
0     1    a
1     2    b
3     3    c

Grouped data:
 City
London      30.0
New York    25.0
Paris       22.0
Tokyo       28.0
Name: Age, dtype: float64

Age squared:
       Name   Age      City  Age_squared
0   

In [3]:
# prompt: take a sample data set and show how to work with numpy and pandas in projects... a small sample project type only

import pandas as pd
import numpy as np

# Sample data (replace with your actual data)
data = {'Product': ['A', 'B', 'A', 'C', 'B', 'A'],
        'Sales': [100, 150, 120, 200, 180, 110],
        'Region': ['North', 'South', 'North', 'East', 'South', 'West']}
df = pd.DataFrame(data)

# 1. Basic Data Exploration
print("First 5 rows:\n", df.head())
print("\nData types:\n", df.dtypes)
print("\nSummary statistics:\n", df.describe())

# 2. Data Cleaning (Handling missing values - example)
# Let's assume some 'Sales' data is missing. We'll replace them with the mean.
df['Sales'].fillna(df['Sales'].mean(), inplace=True)

# 3. Data Manipulation
# Calculate total sales per product
product_sales = df.groupby('Product')['Sales'].sum()
print("\nTotal Sales per product:\n", product_sales)

# 4. Data Filtering
# Find products with sales greater than 150
high_sales_products = df[df['Sales'] > 150]
print("\nProducts with sales > 150:\n", high_sales_products)

# 5. Data Transformation
# Create a new column 'SalesCategory'
def categorize_sales(sales):
    if sales < 150:
        return 'Low'
    elif sales < 200:
        return 'Medium'
    else:
        return 'High'

df['SalesCategory'] = df['Sales'].apply(categorize_sales)
print("\nDataFrame with SalesCategory:\n", df)

# 6. Using NumPy for calculations
# Calculate the mean sales using NumPy
sales_array = df['Sales'].to_numpy()  # Convert pandas Series to numpy array
mean_sales_numpy = np.mean(sales_array)
print("\nMean sales (NumPy):", mean_sales_numpy)


#7. Pivot Tables (Example)
pivot_table = pd.pivot_table(df, values='Sales', index='Product', columns='Region', aggfunc='sum', fill_value=0)
print("\nPivot Table:\n",pivot_table)


First 5 rows:
   Product  Sales Region
0       A    100  North
1       B    150  South
2       A    120  North
3       C    200   East
4       B    180  South

Data types:
 Product    object
Sales       int64
Region     object
dtype: object

Summary statistics:
             Sales
count    6.000000
mean   143.333333
std     40.331956
min    100.000000
25%    112.500000
50%    135.000000
75%    172.500000
max    200.000000

Total Sales per product:
 Product
A    330
B    330
C    200
Name: Sales, dtype: int64

Products with sales > 150:
   Product  Sales Region
3       C    200   East
4       B    180  South

DataFrame with SalesCategory:
   Product  Sales Region SalesCategory
0       A    100  North           Low
1       B    150  South        Medium
2       A    120  North           Low
3       C    200   East          High
4       B    180  South        Medium
5       A    110   West           Low

Mean sales (NumPy): 143.33333333333334

Pivot Table:
 Region   East  North  South  West

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Sales'].fillna(df['Sales'].mean(), inplace=True)
