# Pandas: Data Analysis Made Easy

[Pandas](https://pandas.pydata.org/) is an open-source data manipulation and analysis library for Python. It provides powerful tools for working with structured data, making data analysis tasks more efficient and intuitive.

## Why Pandas?

- **Flexible Data Structures:** Pandas offers two main data structures: Series (1-dimensional) and DataFrame (2-dimensional), which can handle both labeled and unlabelled data.

- **Data Cleaning and Preparation:** Pandas simplifies the process of cleaning and preparing data by providing functions to handle missing data, duplicate entries, data type conversions, and more.

- **Data Exploration and Analysis:** With Pandas, you can easily explore and analyze your data using functions for filtering, sorting, grouping, aggregating, and visualizing data.

- **Integration with Other Libraries:** Pandas seamlessly integrates with other Python libraries like NumPy, Matplotlib, and scikit-learn, making it a powerful tool for data analysis and machine learning workflows.

- **Rich Functionality:** Pandas offers a wide range of functions and methods for data manipulation, including merging and joining datasets, reshaping data, time series analysis, and handling large datasets efficiently.

- **Community Support:** Pandas has a large and active community of users and developers who contribute to its development, provide support, and share resources and best practices.

## Getting Started with Pandas

To get started with Pandas, you can install it using pip:

```bash
pip install pandas


In [44]:
! pip install pandas



Once installed, you can import Pandas in your Python scripts or Jupyter notebooks and start working with your data.

```bash
import pandas as pd

In [45]:
import pandas as pd

## READ

### read from csv file

In [46]:
# Reading CSV files
reviews = pd.read_csv("data/shootings.csv", index_col=0)
# first 3 rows are printed
reviews.head(3)

Unnamed: 0_level_0,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,arms_category
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
3,Tim Elliot,2015-01-02,shot,gun,53.0,M,Asian,Shelton,WA,True,attack,Not fleeing,False,Guns
4,Lewis Lee Lembke,2015-01-02,shot,gun,47.0,M,White,Aloha,OR,False,attack,Not fleeing,False,Guns
5,John Paul Quintero,2015-01-03,shot and Tasered,unarmed,23.0,M,Hispanic,Wichita,KS,False,other,Not fleeing,False,Unarmed


In [47]:
# last 5 rows
reviews.tail()

Unnamed: 0_level_0,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,arms_category
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
5916,Rayshard Brooks,2020-06-12,shot,Taser,27.0,M,Black,Atlanta,GA,False,attack,Foot,True,Electrical devices
5925,Caine Van Pelt,2020-06-12,shot,gun,23.0,M,Black,Crown Point,IN,False,attack,Car,False,Guns
5918,Hannah Fizer,2020-06-13,shot,unarmed,25.0,F,White,Sedalia,MO,False,other,Not fleeing,False,Unarmed
5921,William Slyter,2020-06-13,shot,gun,22.0,M,White,Kansas City,MO,False,other,Other,False,Guns
5924,Nicholas Hirsh,2020-06-15,shot,gun,31.0,M,White,Lawrence,KS,False,attack,Car,False,Guns


In [48]:
# All columns of the dataframe
reviews.columns

Index(['name', 'date', 'manner_of_death', 'armed', 'age', 'gender', 'race',
       'city', 'state', 'signs_of_mental_illness', 'threat_level', 'flee',
       'body_camera', 'arms_category'],
      dtype='object')

In [49]:
# shows that the readed csv file is in dataframe format
type(reviews)

pandas.core.frame.DataFrame

In [50]:
#skip first and 10th row
skip_rows = pd.read_csv("data/shootings.csv",skiprows = [1,10])
skip_rows.head(3)

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,arms_category
0,4,Lewis Lee Lembke,2015-01-02,shot,gun,47.0,M,White,Aloha,OR,False,attack,Not fleeing,False,Guns
1,5,John Paul Quintero,2015-01-03,shot and Tasered,unarmed,23.0,M,Hispanic,Wichita,KS,False,other,Not fleeing,False,Unarmed
2,8,Matthew Hoffman,2015-01-04,shot,toy weapon,32.0,M,White,San Francisco,CA,True,attack,Not fleeing,False,Other unusual objects


In [51]:
# number of null valules in each column
null_count_per_column = reviews.isnull().sum()
null_count_per_column

name                       0
date                       0
manner_of_death            0
armed                      0
age                        0
gender                     0
race                       0
city                       0
state                      0
signs_of_mental_illness    0
threat_level               0
flee                       0
body_camera                0
arms_category              0
dtype: int64

**describe()**
 
function in Pandas is a powerful tool for generating descriptive statistics of numerical data in a DataFrame. When applied to a DataFrame, it provides a summary of various statistical measures for each numerical column, including count, mean, standard deviation, minimum, maximum, and percentiles. This function is particularly useful for getting a quick overview of the distribution and central tendency of numerical data, helping users to understand their dataset's characteristics at a glance.

In [52]:
# only numeric columns are considered
reviews.describe()

Unnamed: 0,age
count,4895.0
mean,36.54975
std,12.694348
min,6.0
25%,27.0
50%,35.0
75%,45.0
max,91.0


In [53]:
# show the data type of each column
reviews.dtypes

name                        object
date                        object
manner_of_death             object
armed                       object
age                        float64
gender                      object
race                        object
city                        object
state                       object
signs_of_mental_illness       bool
threat_level                object
flee                        object
body_camera                   bool
arms_category               object
dtype: object

**Series in DataFrame**

In Pandas, a Series is a one-dimensional array-like object that can hold any data type (e.g., integers, strings, floats). It's essentially a labeled array capable of holding data of any type.

**Usage**

A Series is commonly used to represent a single column or row of data in a DataFrame. It can be created from various data structures like lists, dictionaries, or NumPy arrays.

In [54]:
# Series
type(reviews['age'])

pandas.core.series.Series

In [55]:
# when two square brackets are given then its considered as data frame
type(reviews[['age']])

pandas.core.frame.DataFrame

In [56]:
# getting the columns with name
reviews[['age', 'gender']]

Unnamed: 0_level_0,age,gender
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,53.0,M
4,47.0,M
5,23.0,M
8,32.0,M
9,39.0,M
...,...,...
5916,27.0,M
5925,23.0,M
5918,25.0,F
5921,22.0,M


### reading from excel file



here we need to install a module called openpyxl

```bash
! pip install openpyxl
```

In [57]:
! pip install openpyxl



In [58]:
df_excel = pd.read_excel("data/Cola.xlsx", names = ['Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6', 'Column7', 'Column8', 'Column9', 'Column10', 'Column11'])
df_excel.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11
,,,,,,,,,,,
Profit & Loss statement,,,,,,,,,,,
,in million USD,FY '09,FY '10,FY '11,FY '12,FY '13,FY '14,FY '15,FY '16,FY '17,FY '18
,NET OPERATING REVENUES,30990,35119,46542,48017,46854,45998,44294,41863,35410,31856
,Cost of goods sold,11088,12693,18215,19053,18421,17889,17482,16465,13255,11770


### Reading from HTML


the data retrived from html page is a list we need to convert the data to csv, for this we need to install a module called lxml

```bash
! pip install lxml
```

In [59]:
! pip install lxml



In [60]:
df_html = pd.read_html('https://www.basketball-reference.com/teams/TOR/2024.html')
df_html

[       No.                      Player Pos    Ht   Wt         Birth Date  \
 0        4              Scottie Barnes  SG   6-7  227     August 1, 2001   
 1       33              Gary Trent Jr.  SG   6-5  209   January 18, 1999   
 2       19                Jakob Poeltl   C   7-0  245   October 15, 1995   
 3       25               Chris Boucher  PF   6-9  200   January 11, 1993   
 4        1                 Gradey Dick  SG   6-6  205  November 20, 2003   
 5        2             Jalen McDaniels  SF   6-9  205   January 31, 1998   
 6        5           Immanuel Quickley  PG   6-3  190      June 17, 1999   
 7        9                  RJ Barrett  SG   6-6  214      June 14, 2000   
 8       11                 Bruce Brown  PG   6-4  202    August 15, 1996   
 9   11, 34          Jontay Porter (TW)  PF  6-11  240  November 15, 1999   
 10      14              Garrett Temple  SG   6-5  195        May 8, 1986   
 11      13                Jordan Nwora  SF   6-8  225  September 9, 1998   

In [61]:
# read_html returns a list 
type(df_html)

list

In [62]:
# 3rd element of the list is the dataframe
type(df_html[2])

pandas.core.frame.DataFrame

In [63]:
df = df_html[2]
df.head()

Unnamed: 0,Rk,Player,Age,G,GS,MP,FG,FGA,FG%,3P,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1.0,Scottie Barnes,22.0,58,58.0,2040,434,913,0.475,99,...,0.786,136,341,477,353,71,87,159,115,1162
1,2.0,Dennis Schröder,30.0,51,33.0,1559,247,559,0.442,77,...,0.852,22,118,140,313,46,8,83,105,698
2,3.0,Gary Trent Jr.,25.0,53,23.0,1409,230,548,0.42,127,...,0.7,14,107,121,82,49,8,36,69,622
3,4.0,Pascal Siakam,29.0,39,39.0,1354,325,623,0.522,46,...,0.758,54,192,246,190,32,10,83,87,865
4,5.0,Jakob Poeltl,28.0,47,47.0,1250,232,350,0.663,0,...,0.559,137,269,406,116,30,72,74,138,521


### Using Requests Library in Python


The `requests` library is a popular HTTP library for Python, used to make HTTP requests and handle responses easily.

**Installation**

You can install the `requests` library using pip:

```bash
pip install requests

In [64]:
! pip install requests



In [65]:
# JSON data using URL
import requests
res = requests.get('https://api.github.com/repos/pandas-dev/pandas/issues')
data = res.json()
len(data)

30

In [66]:
for i in range(len(data)):
    print(data[i]['url'])

https://api.github.com/repos/pandas-dev/pandas/issues/57672
https://api.github.com/repos/pandas-dev/pandas/issues/57671
https://api.github.com/repos/pandas-dev/pandas/issues/57670
https://api.github.com/repos/pandas-dev/pandas/issues/57668
https://api.github.com/repos/pandas-dev/pandas/issues/57666
https://api.github.com/repos/pandas-dev/pandas/issues/57665
https://api.github.com/repos/pandas-dev/pandas/issues/57664
https://api.github.com/repos/pandas-dev/pandas/issues/57663
https://api.github.com/repos/pandas-dev/pandas/issues/57662
https://api.github.com/repos/pandas-dev/pandas/issues/57661
https://api.github.com/repos/pandas-dev/pandas/issues/57660
https://api.github.com/repos/pandas-dev/pandas/issues/57659
https://api.github.com/repos/pandas-dev/pandas/issues/57657
https://api.github.com/repos/pandas-dev/pandas/issues/57656
https://api.github.com/repos/pandas-dev/pandas/issues/57651
https://api.github.com/repos/pandas-dev/pandas/issues/57648
https://api.github.com/repos/pandas-dev/

In [67]:
data_df = pd.DataFrame(data)
data_df.head()

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,closed_at,author_association,active_lock_reason,body,reactions,timeline_url,performed_via_github_app,state_reason,draft,pull_request
0,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/57672,2160456617,I_kwDOAA0YD86AxfOp,57672,BUG: List of years (as string) raises UserWarn...,...,,CONTRIBUTOR,,### Pandas version checks\n\n- [X] I have chec...,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas...,,,,
1,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/57671,2160328911,PR_kwDOAA0YD85oPp4V,57671,CLN: Enforce deprecation of pinning name in Se...,...,,MEMBER,,- [ ] closes #xxxx (Replace xxxx with the GitH...,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas...,,,False,{'url': 'https://api.github.com/repos/pandas-d...
2,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/57670,2160265182,PR_kwDOAA0YD85oPciS,57670,DOC: Update drop duplicates documentation to s...,...,,NONE,,- [x] closes #56784 \r\n- [x] All [code checks...,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas...,,,False,{'url': 'https://api.github.com/repos/pandas-d...
3,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/57668,2160053175,PR_kwDOAA0YD85oOv9x,57668,CLN: More numpy 2 stuff,...,,MEMBER,,- [ ] closes #xxxx (Replace xxxx with the GitH...,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas...,,,False,{'url': 'https://api.github.com/repos/pandas-d...
4,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/57666,2159798305,I_kwDOAA0YD86Au-gh,57666,BUG: pyarrow stripping leading zeros with dtyp...,...,,NONE,,### Pandas version checks\n\n- [X] I have chec...,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas...,,,,


In [68]:
data_df['user'][:3]

0    {'login': 'jrmylow', 'id': 33999325, 'node_id'...
1    {'login': 'rhshadrach', 'id': 45562402, 'node_...
2    {'login': 'flaviaouyang', 'id': 74445682, 'nod...
Name: user, dtype: object

**Using `pd.DataFrame.from_records()` in Pandas**

The `pd.DataFrame.from_records()` method in Pandas is used to create a DataFrame from a structured array or a sequence of tuples. It's particularly useful when you have a list of dictionaries or a structured array and you want to convert it into a DataFrame.

**Usage**

Suppose you have a DataFrame `data_df` with a column named `'user'`, and you want to create a new DataFrame from the values in this column. You can use `pd.DataFrame.from_records()` as follows:



In [69]:
user_df = pd.DataFrame.from_records(data_df['user'])
user_df.head(2)

Unnamed: 0,login,id,node_id,avatar_url,gravatar_id,url,html_url,followers_url,following_url,gists_url,starred_url,subscriptions_url,organizations_url,repos_url,events_url,received_events_url,type,site_admin
0,jrmylow,33999325,MDQ6VXNlcjMzOTk5MzI1,https://avatars.githubusercontent.com/u/339993...,,https://api.github.com/users/jrmylow,https://github.com/jrmylow,https://api.github.com/users/jrmylow/followers,https://api.github.com/users/jrmylow/following...,https://api.github.com/users/jrmylow/gists{/gi...,https://api.github.com/users/jrmylow/starred{/...,https://api.github.com/users/jrmylow/subscript...,https://api.github.com/users/jrmylow/orgs,https://api.github.com/users/jrmylow/repos,https://api.github.com/users/jrmylow/events{/p...,https://api.github.com/users/jrmylow/received_...,User,False
1,rhshadrach,45562402,MDQ6VXNlcjQ1NTYyNDAy,https://avatars.githubusercontent.com/u/455624...,,https://api.github.com/users/rhshadrach,https://github.com/rhshadrach,https://api.github.com/users/rhshadrach/followers,https://api.github.com/users/rhshadrach/follow...,https://api.github.com/users/rhshadrach/gists{...,https://api.github.com/users/rhshadrach/starre...,https://api.github.com/users/rhshadrach/subscr...,https://api.github.com/users/rhshadrach/orgs,https://api.github.com/users/rhshadrach/repos,https://api.github.com/users/rhshadrach/events...,https://api.github.com/users/rhshadrach/receiv...,User,False


## save

### save as CSV file

A CSV (comma-separated values) file is a plain text file format used to store tabular data

In [70]:

reviews.to_csv("reviews.csv") # save the dataframe as it is to csv

reviews.to_csv("reviews_1.csv", index=False) # save the dataframe as it is to csv without index

reviews.to_csv("reviews_2.csv", index=False, header=False) # save the dataframe as it is to csv without index and header

reviews.to_csv("reviews_3.csv", index=False, header=False, columns=None) # save the dataframe as it is to csv without index, header and columns

reviews.to_csv("reviews_4.csv", index=False, header=False, columns=None, sep=",") # save the dataframe as it is to csv without index, header and columns and with a separator

reviews.to_csv("reviews_5.csv", index=False, header=False, columns=None, sep=",", encoding="utf-8") # save the dataframe as it is to csv without index, header and columns and with a separator and encoding

reviews.to_csv("reviews_6.csv",index=False,sep = "#", columns=['gender','age']) # save the dataframe as it is to csv without index, seperate with '#' and columns age and gender

### Saving Data to a Pickle File 


In Pandas, you can save a DataFrame or Series to a Pickle file using the `to_pickle()` method. Pickle is a binary serialization format in Python that allows you to store data objects in a compact binary format.

**Usage**

To save a DataFrame or Series to a Pickle file, use the `to_pickle()` method:

In [71]:
reviews.to_pickle("my_pickle")

### SAVE as excel file

Excel files are digital spreadsheets that organize data in rows and columns, offering a versatile tool for tasks like calculations, data analysis, and information management.

In [72]:
df.to_excel('player_details.xlsx')

## Synthetic Data Generation for Pandas


Synthetic data generation is the process of creating artificial data that resembles real-world data but is generated algorithmically. This can be useful for various purposes, including testing machine learning models, data augmentation, and privacy-preserving data sharing.

**Usage**

Pandas provides several methods for generating synthetic data:


#### Random Data Generation

Pandas' `DataFrame` constructor can be used to create synthetic data with random values:


##### Explanation of `np.random.randn(10, 4), columns=['A', 'B', 'C', 'D']`

- `np.random.randn(10, 4)`: This code generates a 2-dimensional array of random numbers with a normal distribution (mean=0, variance=1) of shape (10, 4), meaning it will create 10 rows and 4 columns of random numbers.

- `columns=['A', 'B', 'C', 'D']`: This code specifies the column names for the DataFrame that will be created using the random numbers generated above. The DataFrame will have columns named 'A', 'B', 'C', and 'D'.

Together, this code generates a DataFrame with 10 rows and 4 columns, where the values in each column are random numbers drawn from a normal distribution.



In [73]:
import pandas as pd
import numpy as np

# Create a DataFrame with random values
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df.head()

Unnamed: 0,A,B,C,D
0,-0.801544,0.48165,-0.128315,1.171599
1,0.173703,0.126643,0.821466,0.070699
2,-1.760444,0.06897,0.722967,-1.324226
3,1.079414,-1.681141,1.287072,0.22678
4,-0.683783,-0.32079,-1.336605,-0.243152


#### JSON Data

we can also convert string to json data then to dataframe


In [74]:
data = {
  "name": "John Doe",
  "age": 30,
  "city": "New York",
  "interests": ["hiking", "reading", "cooking"]
    }
type(data)

dict

In [75]:
data_df = pd.DataFrame(data)
data_df

Unnamed: 0,name,age,city,interests
0,John Doe,30,New York,hiking
1,John Doe,30,New York,reading
2,John Doe,30,New York,cooking


In [76]:
data_df['interests']

0     hiking
1    reading
2    cooking
Name: interests, dtype: object

##  concepts

### setting data

In [77]:
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'Age': [25, 30, 35, 40, 45],
        'Salary': [50000, 60000, 70000, 80000, 90000],
        'Department': ['HR', 'IT', 'Finance', 'IT', 'HR']}

data2 = {'Name': ['Alice', 'Bob', 'Charlie'],
         'Hire_Date': ['2020-01-15', '2019-05-20', '2021-02-10']}

df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

In [78]:
df.head(2)

Unnamed: 0,Name,Age,Salary,Department
0,Alice,25,50000,HR
1,Bob,30,60000,IT


In [79]:
df2.head(2)

Unnamed: 0,Name,Hire_Date
0,Alice,2020-01-15
1,Bob,2019-05-20


### iloc

**Usage**

The `iloc` indexer allows for integer-based indexing to select rows and columns from a DataFrame.

**Syntax**

```python
dataframe.iloc[row_indexer, column_indexer]


In [80]:
# Using iloc to select specific rows and columns 1 to 3 rows and 0 to 1 column will be selected
df.iloc[1:4, 0:2]

Unnamed: 0,Name,Age
1,Bob,30
2,Charlie,35
3,David,40


In [81]:
# 1 and 3 row and 0 and 2 column
df.iloc[[1,3], [0,2]]

Unnamed: 0,Name,Salary
1,Bob,60000
3,David,80000


### loc

**usage**

`loc` is a label-based indexing method in pandas DataFrame used to access rows and columns by label(s) or a boolean array.

**Syntax**

```python
DataFrame.loc[row_indexer, column_indexer]

In [82]:
input = df['Age'] > 30
input

0    False
1    False
2     True
3     True
4     True
Name: Age, dtype: bool

In [83]:
# Using loc to select data based on labels, only the values with true are displayed
df.loc[input]



Unnamed: 0,Name,Age,Salary,Department
2,Charlie,35,70000,Finance
3,David,40,80000,IT
4,Emily,45,90000,HR


In [84]:
# if index is not specified loc can also used with numeric index just like iloc
df.loc[2]

Name          Charlie
Age                35
Salary          70000
Department    Finance
Name: 2, dtype: object

In [85]:
df.iloc[2]

Name          Charlie
Age                35
Salary          70000
Department    Finance
Name: 2, dtype: object

### Differance between iloc and loc

In [86]:
# converting string to json and json to datframe
import json

# Synthetic data in JSON format
json_data = '''
{
  "data": [
    {"name": "John", "age": 30, "city": "New York"},
    {"name": "Alice", "age": 25, "city": "Los Angeles"},
    {"name": "Bob", "age": 35, "city": "Chicago"},
    {"name": "Emily", "age": 28, "city": "San Francisco"}
  ]
}
'''

# Load JSON data into a Python dictionary
data_dict = json.loads(json_data)

# Create DataFrame from dictionary with specified index
df_specified = pd.DataFrame(data_dict['data']).set_index('name')

df_specified.head(2)

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
John,30,New York
Alice,25,Los Angeles


In [87]:
# df_specified.iloc['Bob'] # error

df_specified.iloc[2]

age          35
city    Chicago
Name: Bob, dtype: object

In [88]:
# df_specified.loc[2] # this will return error

df_specified.loc['Bob']

age          35
city    Chicago
Name: Bob, dtype: object

### Group by


df_grouped = df.groupby('column_name').agg({'column_to_aggregate': 'function'})


In [89]:
# Grouping data by Department and calculating mean salary
grouped = df.groupby('Department')['Salary'].mean()
grouped

Department
Finance    70000.0
HR         70000.0
IT         70000.0
Name: Salary, dtype: float64

In [90]:
type(grouped)

pandas.core.series.Series

In [91]:
grouped_2 = df.groupby('Department').agg({'Salary': 'mean'})
grouped_2

Unnamed: 0_level_0,Salary
Department,Unnamed: 1_level_1
Finance,70000.0
HR,70000.0
IT,70000.0


In [92]:
type(grouped_2)

pandas.core.frame.DataFrame

### Merge

In [93]:
# inner merge
merged_df = pd.merge(df, df2, on='Name')
merged_df


Unnamed: 0,Name,Age,Salary,Department,Hire_Date
0,Alice,25,50000,HR,2020-01-15
1,Bob,30,60000,IT,2019-05-20
2,Charlie,35,70000,Finance,2021-02-10


In [94]:
# left merge
pd.merge(df, df2, on='Name', how='left')

Unnamed: 0,Name,Age,Salary,Department,Hire_Date
0,Alice,25,50000,HR,2020-01-15
1,Bob,30,60000,IT,2019-05-20
2,Charlie,35,70000,Finance,2021-02-10
3,David,40,80000,IT,
4,Emily,45,90000,HR,


In [95]:
# Right merge
pd.merge(df, df2, on='Name', how='right')

Unnamed: 0,Name,Age,Salary,Department,Hire_Date
0,Alice,25,50000,HR,2020-01-15
1,Bob,30,60000,IT,2019-05-20
2,Charlie,35,70000,Finance,2021-02-10


In [96]:
# outter merge
pd.merge(df, df2, on='Name', how='outer')

Unnamed: 0,Name,Age,Salary,Department,Hire_Date
0,Alice,25,50000,HR,2020-01-15
1,Bob,30,60000,IT,2019-05-20
2,Charlie,35,70000,Finance,2021-02-10
3,David,40,80000,IT,
4,Emily,45,90000,HR,


In [97]:
# concatinate
pd.concat([df, df2])


Unnamed: 0,Name,Age,Salary,Department,Hire_Date
0,Alice,25.0,50000.0,HR,
1,Bob,30.0,60000.0,IT,
2,Charlie,35.0,70000.0,Finance,
3,David,40.0,80000.0,IT,
4,Emily,45.0,90000.0,HR,
0,Alice,,,,2020-01-15
1,Bob,,,,2019-05-20
2,Charlie,,,,2021-02-10


### drop

In [98]:
# Dropping a column and update df since inplace = True
df.drop('Department', axis=1, inplace=True)



In [99]:
df

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000
3,David,40,80000
4,Emily,45,90000


In [100]:
# Dropping a row
df_droped = df.drop(1, axis=0, inplace=False)
df_droped

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
2,Charlie,35,70000
3,David,40,80000
4,Emily,45,90000


In [101]:
df # row one not removed since inplace = False

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000
3,David,40,80000
4,Emily,45,90000


In [102]:
# Describing the DataFrame, inlcude = all mean all the numerica and catagorical columns are included
df.describe(include='all')

Unnamed: 0,Name,Age,Salary
count,5,5.0,5.0
unique,5,,
top,Alice,,
freq,1,,
mean,,35.0,70000.0
std,,7.905694,15811.388301
min,,25.0,50000.0
25%,,30.0,60000.0
50%,,35.0,70000.0
75%,,40.0,80000.0


## logic implementation

In [103]:

# Creating synthetic data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 40, 45],
    'Gender': ['Female', 'Male', np.nan, 'Male', 'Female'],
    'Salary': [50000, 60000, 70000, 80000, 90000]
}
df = pd.DataFrame(data)

# Selecting rows where Age is greater than 30
filter_1 = df[df['Age'] > 30]
filter_1


Unnamed: 0,Name,Age,Gender,Salary
2,Charlie,35,,70000
3,David,40,Male,80000
4,Emma,45,Female,90000


In [104]:
# Selecting rows where Age is greater than 30 and Salary is less than 80000
filter_2 = df[(df['Age'] > 30) & (df['Salary'] < 80000)]
filter_2

Unnamed: 0,Name,Age,Gender,Salary
2,Charlie,35,,70000


In [105]:
# Selecting rows where Age is greater than 30 or Gender is 'Female'
filter_3 = df[(df['Age'] > 30) | (df['Gender'] == 'Female')]
filter_3

Unnamed: 0,Name,Age,Gender,Salary
0,Alice,25,Female,50000
2,Charlie,35,,70000
3,David,40,Male,80000
4,Emma,45,Female,90000


In [106]:
df[df['Name'].isin(['David', 'Charlie'])]

Unnamed: 0,Name,Age,Gender,Salary
2,Charlie,35,,70000
3,David,40,Male,80000


In [107]:
df[df['Name'].str.contains('l')]

Unnamed: 0,Name,Age,Gender,Salary
0,Alice,25,Female,50000
2,Charlie,35,,70000


In [108]:
# Filter rows where any column(axis=1) contains NaN values
df[df.isna().any(axis=1)]

Unnamed: 0,Name,Age,Gender,Salary
2,Charlie,35,,70000


In [109]:
# Filter rows where values in 'Age' column satisfy a custom function (e.g., greater than the mean of the column)
df[df['Age'].apply(lambda x: x > df['Age'].mean())]

Unnamed: 0,Name,Age,Gender,Salary
3,David,40,Male,80000
4,Emma,45,Female,90000


In [110]:
# Filter rows where values in 'A' column are greater than 20 and values in 'B' column are less than 40 using query method
df.query('Age > 20 & Salary > 75000')

Unnamed: 0,Name,Age,Gender,Salary
3,David,40,Male,80000
4,Emma,45,Female,90000


In [111]:
type()

TypeError: type() takes 1 or 3 arguments

In [None]:
# try to split and understand this code
df['Name'][df[df['Age']==45].index] 