NumPy Questions

**Pandas** library in Python:

---

## 🐼 **Pandas – Python Library for Data Analysis**

### 🔹 What is Pandas?

**Pandas** is an open-source Python library providing fast, flexible, and expressive data structures designed to work with structured (tabular), semi-structured, and time-series data.

It is especially useful for:

* Cleaning and preparing data
* Analyzing large datasets
* Converting data formats
* Performing data wrangling

---

### 🔹 Key Data Structures in Pandas:

| Structure   | Description                                                                 |
| ----------- | --------------------------------------------------------------------------- |
| `Series`    | One-dimensional labeled array (like a column in Excel)                      |
| `DataFrame` | Two-dimensional labeled data structure (like a table with rows and columns) |

---

### 🔹 Why Use Pandas?

* Easy handling of **missing data**
* Powerful **grouping** and **aggregation**
* High-performance **merging** and **joining**
* Built-in support for **time-series** data
* Easy **reading and writing** to CSV, Excel, SQL, JSON, etc.

---

### 🔹 Common Pandas Functions:

| Function                             | Purpose                             |
| ------------------------------------ | ----------------------------------- |
| `read_csv()`                         | Load data from a CSV file           |
| `head()` / `tail()`                  | View top/bottom rows of a DataFrame |
| `info()` / `describe()`              | Get data summary and statistics     |
| `isnull()` / `dropna()` / `fillna()` | Handle missing data                 |
| `groupby()`                          | Group and aggregate data            |
| `merge()` / `join()` / `concat()`    | Combine DataFrames                  |
| `to_csv()`                           | Export data to a CSV file           |

---

### 🔹 Example Code:

```python
import pandas as pd

# Load CSV
df = pd.read_csv('data.csv')

# Show first 5 rows
print(df.head())

# Get basic info
print(df.info())

# Filter data
filtered = df[df['age'] > 25]

# Group by and mean
grouped = df.groupby('department')['salary'].mean()
```

---

### 🔹 Real-World Use Cases:

* Analyzing sales data
* Preprocessing datasets for machine learning
* Financial data analysis
* Automating reports and dashboards

---



In [None]:
import pandas as pd 
data = {'name':['John', 'Anna', 'Peter', 'Linda'],
        'age':[24, 13, 53, 33],
        'city':['New York', 'Paris', 'Berlin', 'London']}
df = pd.DataFrame(data)
print(df)

#Inspect DataFrame
print(df.head(2))
print(df.describe())

    name  age      city
0   John   24  New York
1   Anna   13     Paris
2  Peter   53    Berlin
3  Linda   33    London
   name  age      city
0  John   24  New York
1  Anna   13     Paris
             age
count   4.000000
mean   30.750000
std    16.938615
min    13.000000
25%    21.250000
50%    28.500000
75%    38.000000
max    53.000000


Loading And Inspecting dataset:

read(filepath)

In [None]:
#load Dataset
import pandas as pd
data =pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|')
print(data.head(10))
#Null Values check
print(data.isnull().sum())

   user_id  age gender     occupation zip_code
0        1   24      M     technician    85711
1        2   53      F          other    94043
2        3   23      M         writer    32067
3        4   24      M     technician    43537
4        5   33      F          other    15213
5        6   42      M      executive    98101
6        7   57      M  administrator    91344
7        8   36      M  administrator    05201
8        9   29      M        student    01002
9       10   53      M         lawyer    90703
user_id       0
age           0
gender        0
occupation    0
zip_code      0
dtype: int64


Missing Values

In [None]:
df.fillna(value=0, inplace=True)
#fill missing values
print('df')

df['column name'].filna(value=0, inplace=True)



In [8]:
import pandas as pd

In [9]:
df=pd.read_csv("https://raw.githubusercontent.com/lovnishverma/datasets/refs/heads/main/testdata.csv")
df.columns

Index(['rank', 'discipline', 'phd', 'service', 'sex', 'salary'], dtype='object')

In [10]:
df.describe()

Unnamed: 0,phd,service,salary
count,80.0,79.0,79.0
mean,19.45,14.860759,107315.886076
std,12.516217,12.17999,27596.27152
min,1.0,0.0,57800.0
25%,10.0,4.5,88000.0
50%,18.0,14.0,104542.0
75%,27.25,20.5,126300.0
max,56.0,51.0,186960.0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   rank        79 non-null     object 
 1   discipline  79 non-null     object 
 2   phd         80 non-null     int64  
 3   service     79 non-null     float64
 4   sex         79 non-null     object 
 5   salary      79 non-null     float64
dtypes: float64(2), int64(1), object(3)
memory usage: 3.9+ KB


In [12]:
df.dtypes

rank           object
discipline     object
phd             int64
service       float64
sex            object
salary        float64
dtype: object

In [13]:
print("\n First few rows:")
print(df.head())

print("\n Last few rows:")
print(df.tail())

print("\n Shape of DataFrame (rows, columns):")
print(df.shape)

print("\n Column names:")
print(df.columns)

print("\n Data types:")
print(df.dtypes)

print(df.info())       # Data types & missing values
print(df.describe())


 First few rows:
   rank discipline  phd  service   sex    salary
0  Prof          B   56     49.0  Male  186960.0
1  Prof          A   12      6.0  Male   93000.0
2   NaN          A   23     20.0  Male  110515.0
3  Prof          A   40     31.0   NaN  131205.0
4  Prof          B   20      NaN  Male  104800.0

 Last few rows:
         rank discipline  phd  service     sex    salary
75       Prof          B   18     10.0  Female  105450.0
76  AssocProf          B   19      6.0  Female  104542.0
77       Prof          B   17     17.0  Female  124312.0
78       Prof          A   28     14.0  Female  109954.0
79       Prof          A   23     15.0  Female  109646.0

 Shape of DataFrame (rows, columns):
(80, 6)

 Column names:
Index(['rank', 'discipline', 'phd', 'service', 'sex', 'salary'], dtype='object')

 Data types:
rank           object
discipline     object
phd             int64
service       float64
sex            object
salary        float64
dtype: object
<class 'pandas.core.frame.

ilock df.iloc[:,:] rows and columns

In [14]:
df.head(10)

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49.0,Male,186960.0
1,Prof,A,12,6.0,Male,93000.0
2,,A,23,20.0,Male,110515.0
3,Prof,A,40,31.0,,131205.0
4,Prof,B,20,,Male,104800.0
5,Prof,A,20,20.0,Male,122400.0
6,AssocProf,A,20,17.0,Male,81285.0
7,Prof,A,18,18.0,Male,126300.0
8,Prof,A,18,18.0,Male,126300.0
9,Prof,A,29,19.0,Male,94350.0


In [16]:
df.head()

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49.0,Male,186960.0
1,Prof,A,12,6.0,Male,93000.0
2,,A,23,20.0,Male,110515.0
3,Prof,A,40,31.0,,131205.0
4,Prof,B,20,,Male,104800.0


Slicing

In [None]:
#df.iloc{:,:}
#print(df[1:10:2]) #Start:Stop:Step


In [None]:
df.iloc[1,1] #Rows , Columns

'A'

In [18]:
df.iloc[0:3,1] #Rows , Columns

0    B
1    A
2    A
Name: discipline, dtype: object

In [19]:
df.iloc[:,0:5]

Unnamed: 0,rank,discipline,phd,service,sex
0,Prof,B,56,49.0,Male
1,Prof,A,12,6.0,Male
2,,A,23,20.0,Male
3,Prof,A,40,31.0,
4,Prof,B,20,,Male
...,...,...,...,...,...
75,Prof,B,18,10.0,Female
76,AssocProf,B,19,6.0,Female
77,Prof,B,17,17.0,Female
78,Prof,A,28,14.0,Female


In [21]:
df.iloc[:,0:-2]

Unnamed: 0,rank,discipline,phd,service
0,Prof,B,56,49.0
1,Prof,A,12,6.0
2,,A,23,20.0
3,Prof,A,40,31.0
4,Prof,B,20,
...,...,...,...,...
75,Prof,B,18,10.0
76,AssocProf,B,19,6.0
77,Prof,B,17,17.0
78,Prof,A,28,14.0


In [25]:
df.loc[:,'phd'] #all rows, and col phd


0     56
1     12
2     23
3     40
4     20
      ..
75    18
76    19
77    17
78    28
79    23
Name: phd, Length: 80, dtype: int64

In [26]:
df.loc[0:2,['rank','phd']]

Unnamed: 0,rank,phd
0,Prof,56
1,Prof,12
2,,23


In [27]:
print(df['rank'])

0          Prof
1          Prof
2           NaN
3          Prof
4          Prof
        ...    
75         Prof
76    AssocProf
77         Prof
78         Prof
79         Prof
Name: rank, Length: 80, dtype: object


In [28]:
print(df[['rank','phd']])

         rank  phd
0        Prof   56
1        Prof   12
2         NaN   23
3        Prof   40
4        Prof   20
..        ...  ...
75       Prof   18
76  AssocProf   19
77       Prof   17
78       Prof   28
79       Prof   23

[80 rows x 2 columns]


In [29]:
df[df['service']>40]

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49.0,Male,186960.0
10,Prof,A,51,51.0,Male,57800.0
29,Prof,A,45,43.0,Male,155865.0
38,Prof,B,45,45.0,Male,146856.0


In [None]:
print(df.sort_values(by='service')) #assending values

        rank discipline  phd  service     sex    salary
13  AsstProf          B    1      0.0    Male   88000.0
14  AsstProf          B    1      0.0    Male   88000.0
25  AsstProf          A    2      0.0    Male   85000.0
19  AsstProf          B    4      0.0    Male   92000.0
54      Prof          A   12      0.0  Female  105000.0
..       ...        ...  ...      ...     ...       ...
29      Prof          A   45     43.0    Male  155865.0
38      Prof          B   45     45.0    Male  146856.0
0       Prof          B   56     49.0    Male  186960.0
10      Prof          A   51     51.0    Male   57800.0
4       Prof          B   20      NaN    Male  104800.0

[80 rows x 6 columns]


create a new column using existing

In [32]:
df['is_senior']=df['rank']=='prof' #Eg boolean logic
print(df[['rank','is_senior']])

         rank  is_senior
0        Prof      False
1        Prof      False
2         NaN      False
3        Prof      False
4        Prof      False
..        ...        ...
75       Prof      False
76  AssocProf      False
77       Prof      False
78       Prof      False
79       Prof      False

[80 rows x 2 columns]


In [43]:
df.rename(columns={'rank':'Rank','service':'Service'}, inplace=True) # inplace=True
print(df.columns)

Index(['Rank', 'discipline', 'phd', 'Service', 'sex', 'salary', 'is_senior'], dtype='object')


In [39]:
df.duplicated().sum()

np.int64(2)

In [None]:
df.duplicated().sum()

df.drop_duplicates()

df.duplicated().sum()
df.to_csv("Cleaned Csv", index=False) #Download Data

In [47]:
print(df.shape)

(80, 7)
