Create a data frame having at least 3 columns and 50 rows to store numeric data generated using a random
function. Replace 10% of the values by null values whose index positions are generated using random function.

Do the following:
a. Identify and count missing values in a data frame.
b. Drop the column having more than 5 null values.
c. Identify the row label having maximum of the sum of all values in a row and drop that row.
d. Sort the data frame on the basis of the first column.
e. Remove all duplicates from the first column.
f. Find the correlation between first and second column and covariance between second and third
column.
g. Discretize the second column and create 5 bins.

In [25]:
import pandas as pd
import numpy as np

# Create a data frame with 3 columns and 50 rows
data = {'Column1': np.random.randint(1,30,50),
        'Column2': np.random.randint(50,100,50),
        'Column3': np.random.randint(100,150,50),
        'Column4': np.random.randint(150,175,50)}

df = pd.DataFrame(data)
df

Unnamed: 0,Column1,Column2,Column3,Column4
0,18,62,107,155
1,10,70,100,170
2,21,57,143,171
3,4,95,104,157
4,12,89,102,156
5,27,95,100,170
6,8,56,103,159
7,2,72,139,155
8,22,93,134,163
9,13,81,131,153


In [26]:
# Replace 10% of the values with null values 
mask = np.random.rand(*df.shape) < 0.1
df[mask] = np.nan
df

Unnamed: 0,Column1,Column2,Column3,Column4
0,18.0,,107.0,155.0
1,10.0,70.0,100.0,170.0
2,21.0,57.0,143.0,171.0
3,4.0,95.0,104.0,157.0
4,12.0,89.0,102.0,156.0
5,27.0,95.0,100.0,170.0
6,8.0,,103.0,159.0
7,2.0,72.0,139.0,155.0
8,22.0,93.0,134.0,
9,13.0,81.0,131.0,


In [27]:
# a. Identify and count missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)

Missing Values:
 Column1    3
Column2    7
Column3    4
Column4    6
dtype: int64


In [28]:
# b. Drop the column having more than 5 null values
df = df.dropna(thresh=5, axis=1)
df

Unnamed: 0,Column1,Column2,Column3,Column4
0,18.0,,107.0,155.0
1,10.0,70.0,100.0,170.0
2,21.0,57.0,143.0,171.0
3,4.0,95.0,104.0,157.0
4,12.0,89.0,102.0,156.0
5,27.0,95.0,100.0,170.0
6,8.0,,103.0,159.0
7,2.0,72.0,139.0,155.0
8,22.0,93.0,134.0,
9,13.0,81.0,131.0,


In [21]:
# c. Identify the row label having the maximum sum and drop that row

max_sum_row_label = df.sum(axis=1).idxmax()
df = df.drop(max_sum_row_label)
df

KeyError: "['Column3'] not found in axis"

In [29]:
# d. Sort the data frame based on the first column

df = df.sort_values(by='Column1')
df

Unnamed: 0,Column1,Column2,Column3,Column4
7,2.0,72.0,139.0,155.0
13,2.0,92.0,124.0,166.0
40,3.0,,136.0,153.0
3,4.0,95.0,104.0,157.0
43,7.0,96.0,103.0,162.0
18,8.0,62.0,126.0,152.0
23,8.0,64.0,112.0,169.0
6,8.0,,103.0,159.0
46,9.0,96.0,123.0,166.0
21,9.0,78.0,,165.0


In [35]:
# e. Remove all duplicates from the first column
df = df.drop_duplicates(subset='Column1')
print(len(df.index))
df

26


Unnamed: 0,Column1,Column2,Column3,Column4
7,2.0,72.0,139.0,155.0
40,3.0,,136.0,153.0
3,4.0,95.0,104.0,157.0
43,7.0,96.0,103.0,162.0
18,8.0,62.0,126.0,152.0
46,9.0,96.0,123.0,166.0
1,10.0,70.0,100.0,170.0
14,11.0,74.0,109.0,172.0
34,12.0,86.0,119.0,
9,13.0,81.0,131.0,


In [36]:
# f. Find the correlation and covariance
correlation = df['Column1'].corr(df['Column2'])
covariance = df['Column2'].cov(df['Column3'])
print("Correlation between Column1 and Column2:", correlation)
print("Covariance between Column2 and Column3:", covariance)

Correlation between Column1 and Column2: -0.27961706717231566
Covariance between Column2 and Column3: -33.567099567099575


### cut command creates equispaced bins but frequency of samples is unequal in each bin

### qcut command creates unequal size bins but frequency of samples is equal in each bin

In [37]:
# g. Discretize the second column into 5 bins
df['Column2_bins'] = pd.cut(df['Column2'], bins=5)

# Display the final data frame
print("Final Data Frame:\n", df)

Final Data Frame:
     Column1  Column2  Column3  Column4    Column2_bins
7       2.0     72.0    139.0    155.0    (65.4, 73.8]
40      3.0      NaN    136.0    153.0             NaN
3       4.0     95.0    104.0    157.0    (90.6, 99.0]
43      7.0     96.0    103.0    162.0    (90.6, 99.0]
18      8.0     62.0    126.0    152.0  (56.958, 65.4]
46      9.0     96.0    123.0    166.0    (90.6, 99.0]
1      10.0     70.0    100.0    170.0    (65.4, 73.8]
14     11.0     74.0    109.0    172.0    (73.8, 82.2]
34     12.0     86.0    119.0      NaN    (82.2, 90.6]
9      13.0     81.0    131.0      NaN    (73.8, 82.2]
25     14.0     99.0    135.0    173.0    (90.6, 99.0]
47     15.0     57.0    134.0    172.0  (56.958, 65.4]
11     16.0     71.0      NaN    162.0    (65.4, 73.8]
37     17.0     68.0    130.0    161.0    (65.4, 73.8]
19     18.0     67.0    126.0    173.0    (65.4, 73.8]
35     19.0     82.0    110.0      NaN    (73.8, 82.2]
31     20.0     62.0    113.0    166.0  (56.95

In [41]:
df['Column2_bins'] = pd.qcut(df['Column2'], q=5)

# Display the final data frame
print("Final Data Frame:\n", df)

Final Data Frame:
     Column1  Column2  Column3  Column4    Column2_bins
7       2.0     72.0    139.0    155.0    (71.8, 76.4]
40      3.0      NaN    136.0    153.0             NaN
3       4.0     95.0    104.0    157.0    (90.2, 99.0]
43      7.0     96.0    103.0    162.0    (90.2, 99.0]
18      8.0     62.0    126.0    152.0  (56.999, 65.2]
46      9.0     96.0    123.0    166.0    (90.2, 99.0]
1      10.0     70.0    100.0    170.0    (65.2, 71.8]
14     11.0     74.0    109.0    172.0    (71.8, 76.4]
34     12.0     86.0    119.0      NaN    (76.4, 90.2]
9      13.0     81.0    131.0      NaN    (76.4, 90.2]
25     14.0     99.0    135.0    173.0    (90.2, 99.0]
47     15.0     57.0    134.0    172.0  (56.999, 65.2]
11     16.0     71.0      NaN    162.0    (65.2, 71.8]
37     17.0     68.0    130.0    161.0    (65.2, 71.8]
19     18.0     67.0    126.0    173.0    (65.2, 71.8]
35     19.0     82.0    110.0      NaN    (76.4, 90.2]
31     20.0     62.0    113.0    166.0  (56.99