1.Getting familiar with pandas.

In [1]:
import pandas as pd

In [2]:
# Creating a Series from a list
series = pd.Series([10, 20, 30, 40, 50])
print("Series from list:")
print(series)


Series from list:
0    10
1    20
2    30
3    40
4    50
dtype: int64


In [3]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['sudha', 'vardhan', 'sasank', 'sai'],
    'Age': [25, 32, 23, 45],
    'City': ['New York', 'india', 'Chicago', 'UK']
}
df = pd.DataFrame(data)
print("\nDataFrame from dictionary:")
print(df)



DataFrame from dictionary:
      Name  Age      City
0    sudha   25  New York
1  vardhan   32     india
2   sasank   23   Chicago
3      sai   45        UK


Operations on Dataframes:

Selecting data(single column and multiple column):

In [4]:
age_column = df['Age']
print("\nSelected column (Age):")
print(age_column)

# Selecting multiple columns
subset = df[['Name', 'Age']]
print("\nSubset of DataFrame:\n", subset)


Selected column (Age):
0    25
1    32
2    23
3    45
Name: Age, dtype: int64

Subset of DataFrame:
       Name  Age
0    sudha   25
1  vardhan   32
2   sasank   23
3      sai   45


Filtering data:

In [5]:
#Based on a condition
filtered_df = df[df['Age'] > 30]
print("\nFiltered DataFrame (Age > 30):")
print(filtered_df)


Filtered DataFrame (Age > 30):
      Name  Age   City
1  vardhan   32  india
3      sai   45     UK


Modifying of data:

In [6]:
df.loc[0, 'Age'] = 35  # Update a single value
print("\nDataFrame after modifying age of the first entry:\n", df)


DataFrame after modifying age of the first entry:
       Name  Age      City
0    sudha   35  New York
1  vardhan   32     india
2   sasank   23   Chicago
3      sai   45        UK


2. Data Handling with Pandas

In [7]:
#create df with missing values
data_with_missing = {
    'Product': ['A', 'B', 'C', None],
    'Price': [100, 200, None, 400]
}
df_missing = pd.DataFrame(data_with_missing)
print("\nDataFrame with missing values:")
print(df_missing)


DataFrame with missing values:
  Product  Price
0       A  100.0
1       B  200.0
2       C    NaN
3    None  400.0


Handling

In [8]:
df_filled = df_missing.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_filled)


DataFrame after filling missing values with 0:
  Product  Price
0       A  100.0
1       B  200.0
2       C    0.0
3       0  400.0


Checking missing data

In [11]:
print("\nDataFrame with missing values:\n", df_missing)
print("\nChecking for missing data:\n", df_missing.isnull())


DataFrame with missing values:
   Product  Price
0       A  100.0
1       B  200.0
2       C    NaN
3    None  400.0

Checking for missing data:
    Product  Price
0    False  False
1    False  False
2    False   True
3     True  False


Conversion of data types:

In [10]:
df['Age'] = df['Age'].astype(float)
print("\nDataFrame with 'Age' converted to float:")
print(df)


DataFrame with 'Age' converted to float:
      Name   Age      City
0    sudha  35.0  New York
1  vardhan  32.0     india
2   sasank  23.0   Chicago
3      sai  45.0        UK


3. Data Analysis with Pandas

In [12]:
print("\nSummary statistics of the DataFrame:")
print(df.describe())


Summary statistics of the DataFrame:
             Age
count   4.000000
mean   33.750000
std     9.069179
min    23.000000
25%    29.750000
50%    33.500000
75%    37.500000
max    45.000000


Grouping Data and Applying Aggregate Functions:

In [13]:
grouped_df = df.groupby('City').agg({'Age': 'mean'})
print("\nGrouped data by City with mean Age:")
print(grouped_df)


Grouped data by City with mean Age:
           Age
City          
Chicago   23.0
New York  35.0
UK        45.0
india     32.0


Merging two dataframes

In [14]:
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Salary': [70000, 80000, 90000]})
merged_df = pd.merge(df1, df2, on='ID', how='outer')
print("\nMerged DataFrame:")
print(merged_df)


Merged DataFrame:
   ID     Name   Salary
0   1    Alice      NaN
1   2      Bob  70000.0
2   3  Charlie  80000.0
3   4      NaN  90000.0


Concatenation

In [15]:
df3 = pd.DataFrame({'Name': ['Eve', 'Frank'], 'Age': [29, 36]})
concatenated_df = pd.concat([df, df3], ignore_index=True)
print("\nConcatenated DataFrame:")
print(concatenated_df)


Concatenated DataFrame:
      Name   Age      City
0    sudha  35.0  New York
1  vardhan  32.0     india
2   sasank  23.0   Chicago
3      sai  45.0        UK
4      Eve  29.0       NaN
5    Frank  36.0       NaN


4.Application in Data Science

Efficiency: Pandas is built on top of NumPy and provides optimized performance for data manipulation tasks.

Ease of Use: Pandas offers an intuitive and user-friendly API for data manipulation, making complex tasks simpler and faster.

Integration: Pandas integrates well with other Python libraries such as Matplotlib, Seaborn, and Scikit-learn, enhancing its utility in data science workflows.

Handling Large Datasets: Pandas allows efficient handling of large datasets, enabling data scientists to perform filtering, merging, and aggregation tasks quickly.

Real-world examples:

Data Cleaning: Data scientists often use Pandas to clean datasets by handling missing values, removing duplicates, and converting data types. For example, filling missing values with fillna() or dropping them with dropna().

Exploratory Data Analysis (EDA): Pandas is used to summarize data, visualize trends, and explore relationships between variables. Functions like describe(), groupby(), and plot() are commonly used for EDA.

Time Series Analysis: In finance and economics, Pandas is used for time series data analysis, allowing data scientists to resample, shift, and manipulate time series data effectively.

Conclusion

By working through the examples above, you will gain a solid understanding of how Pandas can be used for data handling and analysis in data science. Pandas is a powerful tool that simplifies data preparation and analysis tasks, making it indispensable for data science professionals dealing with real-world data.