## Creating DataFrames
Let’s look at different ways to create a Pandas DataFrame — the core data structure you’ll be using 90% of the time in data science.

### Using Python List

In [6]:
import pandas as pd
data = [
    ["Himani",23],
    ["Shubham",25],
    ["Vaishanavi",19]
]
df = pd.DataFrame(data, columns=["Name", "Age"])
print(df)

         Name  Age
0      Himani   23
1     Shubham   25
2  Vaishanavi   19


### Using Python Dictionary

In [8]:
data = {
    "Name": ["Himani" , "Shubham" , "Sakshi" , "Vaishanavi" , "Vaidehi"],
    "Age" : [23, 28, 21, 26, 30],
    "City" : ["Mumbai" , "Pune " , "Jaipur" , "Chennai" , "Delhi"]
}
df= pd.DataFrame(data)
print(df)

         Name  Age     City
0      Himani   23   Mumbai
1     Shubham   28    Pune 
2      Sakshi   21   Jaipur
3  Vaishanavi   26  Chennai
4     Vaidehi   30    Delhi


### Using NumPy Array

In [12]:
import numpy as np
data = np.array([
    [23, 28],
    [78, 92],
    [56, 80],
    [82 , 13]
])
df = pd.DataFrame(data , columns = ["Age","Roll.NO"])
print(df)

   Age  Roll.NO
0   23       28
1   78       92
2   56       80
3   82       13


### From .csv File

In [13]:
pd.read_csv("annual-enterprise-survey-2024-financial-year-provisional.csv")

Unnamed: 0,Year,Industry_aggregation_NZSIOC,Industry_code_NZSIOC,Industry_name_NZSIOC,Units,Variable_code,Variable_name,Variable_category,Value,Industry_code_ANZSIC06
0,2024,Level 1,99999,All industries,Dollars (millions),H01,Total income,Financial performance,979594,ANZSIC06 divisions A-S (excluding classes K633...
1,2024,Level 1,99999,All industries,Dollars (millions),H04,"Sales, government funding, grants and subsidies",Financial performance,838626,ANZSIC06 divisions A-S (excluding classes K633...
2,2024,Level 1,99999,All industries,Dollars (millions),H05,"Interest, dividends and donations",Financial performance,112188,ANZSIC06 divisions A-S (excluding classes K633...
3,2024,Level 1,99999,All industries,Dollars (millions),H07,Non-operating income,Financial performance,28781,ANZSIC06 divisions A-S (excluding classes K633...
4,2024,Level 1,99999,All industries,Dollars (millions),H08,Total expenditure,Financial performance,856960,ANZSIC06 divisions A-S (excluding classes K633...
...,...,...,...,...,...,...,...,...,...,...
55615,2013,Level 3,ZZ11,Food product manufacturing,Percentage,H37,Quick ratio,Financial ratios,52,"ANZSIC06 groups C111, C112, C113, C114, C115, ..."
55616,2013,Level 3,ZZ11,Food product manufacturing,Percentage,H38,Margin on sales of goods for resale,Financial ratios,40,"ANZSIC06 groups C111, C112, C113, C114, C115, ..."
55617,2013,Level 3,ZZ11,Food product manufacturing,Percentage,H39,Return on equity,Financial ratios,12,"ANZSIC06 groups C111, C112, C113, C114, C115, ..."
55618,2013,Level 3,ZZ11,Food product manufacturing,Percentage,H40,Return on total assets,Financial ratios,5,"ANZSIC06 groups C111, C112, C113, C114, C115, ..."


### From Excel file

In [14]:
pd.read_excel("Data Refresh Sample Data.xlsx")

Unnamed: 0,Ship Mode,Profit,Unit Price,Shipping Cost,Customer Name
0,Regular Air,-213.25,38.94,35.0,Muhammed MacIntyre
1,Delivery Truck,457.81,208.16,68.02,Barry French
2,Regular Air,46.7075,8.69,2.99,Barry French
3,Regular Air,1198.971,195.99,3.99,Clay Rozendal
4,Regular Air,-4.715,5.28,2.99,Claudia Miner
5,Regular Air,782.91,39.89,3.04,Neola Schneider
6,Regular Air,93.8,15.74,1.39,Allen Rosenblatt
7,Delivery Truck,440.72,100.98,26.22,Sylvia Foulston
8,Regular Air,-481.041,100.98,69.0,Sylvia Foulston
9,Regular Air,-11.682,65.99,5.26,Jim Radford


### Reading from .json File

In [15]:
pd.read_json("data.json")

Unnamed: 0,name,language,id,bio,version
0,Adeel Solangi,Sindhi,V59OF92YF627HFY0,Donec lobortis eleifend condimentum. Cras dict...,6.10
1,Afzal Ghaffar,Sindhi,ENTOCR13RSCLZ6KU,"Aliquam sollicitudin ante ligula, eget malesua...",1.88
2,Aamir Solangi,Sindhi,IAKPO3R4761JDRVG,Vestibulum pharetra libero et velit gravida eu...,7.27
3,Abla Dilmurat,Uyghur,5ZVOEPMJUI4MB4EN,Donec lobortis eleifend condimentum. Morbi ac ...,2.53
4,Adil Eli,Uyghur,6VTI8X6LL0MMPJCC,"Vivamus id faucibus velit, id posuere leo. Mor...",6.49
...,...,...,...,...,...
192,Kristín Sigurðardóttir,Icelandic,ZP5TBBYX6RI2UJ31,Cras dictum dolor lacinia lectus vehicula rutr...,2.80
193,Rohini Vasav,Hindi,UEFML43TCGS04KWM,"Ut accumsan, est vel fringilla varius, purus a...",9.30
194,Sunil Kapoor,Hindi,VY2A0APGVHK5NAW2,"Proin tempus eu risus nec mattis. Ut dictum, l...",8.04
195,Zamokuhle Zulu,isiZulu,XU7BX2F8M5PVZ1EF,Etiam congue dignissim volutpat. Phasellus tin...,8.39


# EDA (Exploratory Data Analysis)
Exploratory Data Analysis (EDA) is an essential first step in any data science project.

It involves taking a deep look at the dataset to understand its structure, spot patterns, identify anomalies, and uncover relationships between variables. This process includes generating summary statistics, checking for missing or duplicate data, and creating visualizations like histograms, box plots, and scatter plots. The goal of EDA is to get a clear picture of what the data is telling you before applying any analysis or machine learning models.

By exploring the data thoroughly, you can make better decisions about how to clean, transform, and model it effectively.

Once your DataFrame is ready, run these to understand your data:
- df.head()         # First 5 rows
- df.tail()         # Last 5 rows
- df.info()         # Column info: types, non-nulls
- df.describe()     # Stats for numeric columns
- df.columns        # List of column names
- df.shape          # (rows, columns)

In [17]:
df.head()

Unnamed: 0,Age,Roll.NO
0,23,28
1,78,92
2,56,80
3,82,13


In [18]:
df.tail()

Unnamed: 0,Age,Roll.NO
0,23,28
1,78,92
2,56,80
3,82,13


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Age      4 non-null      int64
 1   Roll.NO  4 non-null      int64
dtypes: int64(2)
memory usage: 196.0 bytes


In [21]:
 df.describe()

Unnamed: 0,Age,Roll.NO
count,4.0,4.0
mean,59.75,53.25
std,27.035471,38.621022
min,23.0,13.0
25%,47.75,24.25
50%,67.0,54.0
75%,79.0,83.0
max,82.0,92.0


In [23]:
df.columns 

Index(['Age', 'Roll.NO'], dtype='object')

In [22]:
df.shape

(4, 2)