
---

### 📘 **Creating DataFrames in Pandas**

A **DataFrame** is the core data structure in Pandas, widely used in data science for handling, transforming, and analyzing data.

---

### 🧱 1. **From Python Lists**

You can create a DataFrame using a list of lists, where each inner list corresponds to a row.
Column names are specified to make the data more understandable.

---

### 📚 2. **From a Dictionary of Lists**

One of the most popular and readable methods:
Each dictionary key becomes a column name, and its list contains the values for that column.

---

### 🔢 3. **From NumPy Arrays**

If you're working with numerical data, you can convert NumPy arrays into DataFrames.
Always provide column names to interpret the data meaningfully.

---

### 📂 4. **From CSV Files**

Pandas can directly read CSV files, which are commonly used for tabular data.
Useful parameters include:

* `sep` – delimiter
* `header` – row containing column names
* `names` – custom column names
* `index_col`, `usecols`, `nrows` – for refined control

---

### 📊 5. **From Excel Files**

Excel files can also be imported.
You may need libraries like `openpyxl` or `xlrd` depending on the Excel file version.

---

### 🌐 6. **From JSON**

Data from JSON files, strings, or URLs can be converted into DataFrames.
This is helpful for working with structured or nested data formats.

---

### 🗄️ 7. **From SQL Databases**

Pandas can connect to SQL databases and load query results into a DataFrame.
This is efficient for accessing and analyzing large datasets stored in relational databases.

---

### 🌍 8. **From the Web**

You can directly read online files (like CSVs hosted on a URL) using Pandas.
Ideal for accessing public or real-time datasets.

---

### 🔍 **Exploratory Data Analysis (EDA)**

EDA is the step of investigating and summarizing a dataset before building models.
It helps identify patterns, detect issues, and understand relationships.

**Common EDA steps include:**

* Generating descriptive statistics
* Checking data types and null values
* Finding duplicates and outliers
* Creating visualizations like histograms, box plots, scatter plots

---

### 🧭 **Essential EDA Commands (Theory)**

| Purpose         | Description                                                      |
| --------------- | ---------------------------------------------------------------- |
| View Structure  | Understand number of rows, columns, and data types (`df.info()`) |
| Summary Stats   | Get statistics for numeric columns (`df.describe()`)             |
| Column Overview | View column names (`df.columns`)                                 |
| Quick Look      | Preview data (`df.head()`, `df.tail()`)                          |

---

### ✅ **Summary**

* **DataFrames** can be created from many sources: lists, dictionaries, NumPy arrays, files (CSV/Excel), databases, JSON, or even web links.
* **EDA** is crucial for cleaning and understanding your data before applying any analysis or machine learning models.

---



In [1]:
import pandas as pd

data = [["Mudabbir",22],["Omkaer",21],["Joy",34]]

data

[['Mudabbir', 22], ['Omkaer', 21], ['Joy', 34]]

In [2]:
pd.DataFrame(data, columns=["Name","Marks"])

Unnamed: 0,Name,Marks
0,Mudabbir,22
1,Omkaer,21
2,Joy,34


**From Dictionary of Lists**

In [6]:
data = { "A":[1,2,33],"B":[23,44,3]}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,1,23
1,2,44
2,33,3


In [11]:
pd.read_csv("data.csv")

Unnamed: 0,Actor,Film,Year,Genre,BoxOffice(INR Crore),IMDb
0,Shah Rukh Khan,Pathaan,2023,Action,1050,7.2
1,Salman Khan,Tiger Zinda Hai,2017,Action,565,6.0
2,Aamir Khan,Dangal,2016,Biography,2024,8.4
3,Ranbir Kapoor,Brahmastra,2022,Fantasy,431,5.6
4,Ranveer Singh,Padmaavat,2018,Historical,585,7.0
5,Ayushmann Khurrana,Andhadhun,2018,Thriller,111,8.3
6,Rajkummar Rao,Stree,2018,Horror Comedy,180,7.5
7,Hrithik Roshan,War,2019,Action,475,6.5
8,Akshay Kumar,Good Newwz,2019,Comedy,318,7.0
9,Kartik Aaryan,Bhool Bhulaiyaa 2,2022,Horror Comedy,266,5.9


In [12]:
pd.read_json("data.json")

Unnamed: 0,name,lang
0,Mudabbir,python
1,Esha,python
2,OM,JAVA


In [14]:
pd.read_json("codebook_data.json")

Unnamed: 0,users,pages
0,"{'id': 1, 'name': 'Amit', 'friends': [2, 3], '...","{'id': 101, 'name': 'Python Developers'}"
1,"{'id': 2, 'name': 'Priya', 'friends': [1, 4], ...","{'id': 102, 'name': 'Data Science Enthusiasts'}"
2,"{'id': 3, 'name': 'Rahul', 'friends': [1], 'li...","{'id': 103, 'name': 'AI & ML Community'}"
3,"{'id': 4, 'name': 'Sara', 'friends': [2], 'lik...","{'id': 104, 'name': 'Web Dev Hub'}"


In [15]:
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
df

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica


In [17]:
df.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


In [18]:
df.tail()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica
148,5.9,3.0,5.1,1.8,Iris-virginica


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   5.1          149 non-null    float64
 1   3.5          149 non-null    float64
 2   1.4          149 non-null    float64
 3   0.2          149 non-null    float64
 4   Iris-setosa  149 non-null    object 
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [20]:
df.describe()


Unnamed: 0,5.1,3.5,1.4,0.2
count,149.0,149.0,149.0,149.0
mean,5.848322,3.051007,3.774497,1.205369
std,0.828594,0.433499,1.759651,0.761292
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.4,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [22]:
df.shape

(149, 5)

In [23]:
df.columns

Index(['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'], dtype='object')