In [1]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv("Churn_Modelling.csv")

In [4]:
df.shape

(10000, 14)

In [5]:
df.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

### 1. Dropping columns

The axis parameter is set as 1 to drop columns and 0 for rows. The inplace parameter is set as True to save the changes.

In [6]:
df.drop(['RowNumber', 'CustomerId', 'Surname', 'CreditScore'], axis=1, inplace=True)

In [7]:
df.shape

(10000, 10)

### 2. Select particular columns while reading

We can read only some of the columns from the csv file. The list of columns is passed to the <b>usecols</b> parameter while reading.

In [8]:
df_spec = pd.read_csv("Churn_Modelling.csv", usecols=['Gender', 'Age', 'Tenure', 'Balance'])

In [9]:
df_spec.head()

Unnamed: 0,Gender,Age,Tenure,Balance
0,Female,42,2,0.0
1,Female,41,1,83807.86
2,Female,42,8,159660.8
3,Female,39,1,0.0
4,Female,43,2,125510.82


### 3. Reading a part of the dataframe

The read_csv function allows reading a part of the dataframe in terms of the rows) (nrows parameters). There are two options. The first one is to read the first n number of rows.

We can also select rows from the end of the file by using the skiprows parameter. Skiprows=5000 means that we will skip the first 5000 rows while reading the csv file.

In [10]:
df_partial = pd.read_csv("Churn_Modelling.csv", nrows=5000)

In [11]:
df_partial.shape

(5000, 14)

### 4. Sample

After creating a dataframe, we may want to draw a small sample to work. We can either use the n parameter or frac parameter to determine the sample size.

* n: The number of rows in the sample
* frac: The ratio of the sample size to the whole dataframe size

In [14]:
df_sample = df.sample(n=1000)
df_sample.shape

(1000, 10)

In [15]:
df_sample2 = df.sample(frac=0.1)
df_sample2.shape

(1000, 10)

### 5. Checking the missing values

The **isna** function determines the missing values in a dataframe.

In [16]:
df.isna().sum()

Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

### 6. Select rows and columns based on index (iloc) or label (loc).

In [17]:
missing_index = np.random.randint(10000, size=20)
df.loc[missing_index, ['Balance','Geography']] = np.nan

In [18]:
df.iloc[missing_index, -1] = np.nan

“-1” is the index of the last column which is “Exited”.

In [19]:
df.isna().sum()

Geography          20
Gender              0
Age                 0
Tenure              0
Balance            20
NumOfProducts       0
HasCrCard           0
IsActiveMember      0
EstimatedSalary     0
Exited             20
dtype: int64

### 7. Filling missing values

In [21]:
df['Geography'].value_counts()

France     5003
Germany    2504
Spain      2473
Name: Geography, dtype: int64

In [22]:
mode = df['Geography'].value_counts().index[0]
print (mode)

France


In [23]:
df['Geography'].fillna(value=mode, inplace=True)

In [24]:
df['Geography'].value_counts()

France     5023
Germany    2504
Spain      2473
Name: Geography, dtype: int64

In [25]:
avg = df['Balance'].mean()
df['Balance'].fillna(value=avg, inplace=True)

### 8. Dropping missing values

In [26]:
df.dropna(axis=0, how='any', inplace=True)

The **axis=1** is used to drop columns with missing values. We can also set a threshold value for the number of non-missing values required for a column or row to have. For instance, **thresh=5** means that a row must have at least 5 non-missing values not to be dropped. The rows that have 4 or fewer missing values will be dropped.

In [27]:
df.isna().sum().sum()

0