# Pandas Practical Learning

About: **Pandas** is a Python library that provides extensive means for data analysis. Most of the time the data to be analyzed is stored in table formats like .csv, .tsv, or .xlsx. Pandas makes it 
- very convenient to load, process, and analyze such tabular data using SQL-like queries. 
- It has functions for analyzing, cleaning, exploring, and manipulating data.
- In conjunction with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of tabular data.

Data structures in Pandas: 
- Series:  a one-dimensional indexed array of some fixed data type
- DataFrame classes: a two-dimensional data structure - a table - where each column contains data of the same type

In [2]:
import numpy as np
import pandas as pd

In [3]:
#Reading the data present in csv form using read_csv
df = pd.read_csv("telecom_churn.csv")

The first in data anaylsis is to get familiar with the structure of data i.e what different kind of data is available to us. Since most of the size of data set we are analyzing is very large only having a look at the first few entries is enough for us. To take a look at the first 5 entries we use **.head()** function.

In [4]:
df.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


Observations: Each row corresponds to one client or an instance, and columns are features of this instance.

Before performing any operations we must be aware of the shape of the data (i.e number of row & col). For this we use **.shape** function which returns a tuples with two values:

- first one represent number of rows (no. of examples)

- second one represent number of cols (no. of features)

In [5]:
print(df.shape)

(3333, 20)


Now let's try printing out column names using **.columns**.

In [6]:
print(df.columns)

Index(['State', 'Account length', 'Area code', 'International plan',
       'Voice mail plan', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls',
       'Churn'],
      dtype='object')


## Viewing general info
To get some general information about dataframe we can use **.info()**, which would help us to better understand the data. Like *datatypes* of the feature values or if there are any *null values* present in the data (i.e missing).

In [7]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   State                   3333 non-null   object 
 1   Account length          3333 non-null   int64  
 2   Area code               3333 non-null   int64  
 3   International plan      3333 non-null   object 
 4   Voice mail plan         3333 non-null   object 
 5   Number vmail messages   3333 non-null   int64  
 6   Total day minutes       3333 non-null   float64
 7   Total day calls         3333 non-null   int64  
 8   Total day charge        3333 non-null   float64
 9   Total eve minutes       3333 non-null   float64
 10  Total eve calls         3333 non-null   int64  
 11  Total eve charge        3333 non-null   float64
 12  Total night minutes     3333 non-null   float64
 13  Total night calls       3333 non-null   int64  
 14  Total night charge      3333 non-null   

### Observations:

- bool, int64, float64 and object are the data types of our features. 
- one feature is logical (bool), 3 features are of type object, and 16 features are numeric
- No missing values because each column contains 3333 observations, the same number of rows we saw before with shape.

## Getting the stats about the data:

Knowing some stats about the data always help in analyzation. To get some we can use the **describe()** function of the pandas. Let's see how that works:

In [8]:
df.describe()

Unnamed: 0,Account length,Area code,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


### Observations:

The describe method shows basic statistical characteristics of each numerical feature (int64 and float64 types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

**NOTE:** One thing to notice is that our describe function didn't give any statistical observation on **Churn** feature. The reason is being that, the describe function by default **.describe()** only works for numerical values as most of the stat values can't be calculated for a bool values.

In order to see statistics on non-numerical features(such as **bool values**), one has to explicitly indicate data types of interested feature in the include parameter.

In [9]:
df.describe(include=["object", "bool"])

Unnamed: 0,State,International plan,Voice mail plan,Churn
count,3333,3333,3333,3333
unique,51,2,2,2
top,WV,No,No,False
freq,106,3010,2411,2850


## Observations:

count - number of examples available of the feature

unique - number of unique values of the feature

top - the most ocurring value 

freq - frequency of the most occuring value

To see the count of a each value in the feature, we can use .value_counts().

Let's have a look at the distribution of Churn:

In [10]:
df["Churn"].value_counts()

False    2850
True      483
Name: Churn, dtype: int64

## Performing various operations on the pandas dataFrames

#### Sorting:

A DataFrame can be sorted by the value of one of the variables/features (i.e columns). For example, we can sort by Total day charge (use ascending=False to sort in descending order):

In [11]:
result = df.sort_values(by="Total day charge", ascending=False).head()

We can also sort by multiple columns:

In [18]:
#outputs the data in the increasing order of Churn values but decreasing order of total day mins
# that means the first row would be the one which has the most daily minutes and least value of churn
df.sort_values(by=["Churn", "Total day minutes"], ascending=[True, False]).head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
688,MN,13,510,No,Yes,21,315.6,105,53.65,208.9,71,17.76,260.1,123,11.7,12.1,3,3.27,3,False
2259,NC,210,415,No,Yes,31,313.8,87,53.35,147.7,103,12.55,192.7,97,8.67,10.1,7,2.73,3,False
534,LA,67,510,No,No,0,310.4,97,52.77,66.5,123,5.65,246.5,99,11.09,9.2,10,2.48,4,False
575,SD,114,415,No,Yes,36,309.9,90,52.68,200.3,89,17.03,183.5,105,8.26,14.2,2,3.83,1,False
2858,AL,141,510,No,Yes,28,308.0,123,52.36,247.8,128,21.06,152.9,103,6.88,7.4,3,2.0,1,False


## Indexing and retrieving data :

When we want to work with the specific part of the data, retrieval of data becomes so crucial. Pandas provides efficient ways to retrieve tabular data. The most commnly used funcitons are:

#### DataFrame['Name'] construction - To get all the values in a single column

For eg; 
    1. Let's say we want to know which all unique states are present in the dataset.
    **OR**
    2. We want to calculate the average day minutes

In [21]:
# unique states
df['State'].unique()

array(['KS', 'OH', 'NJ', 'OK', 'AL', 'MA', 'MO', 'LA', 'WV', 'IN', 'RI',
       'IA', 'MT', 'NY', 'ID', 'VT', 'VA', 'TX', 'FL', 'CO', 'AZ', 'SC',
       'NE', 'WY', 'HI', 'IL', 'NH', 'GA', 'AK', 'MD', 'AR', 'WI', 'OR',
       'MI', 'DE', 'UT', 'CA', 'MN', 'SD', 'NC', 'WA', 'NM', 'NV', 'DC',
       'KY', 'ME', 'MS', 'TN', 'PA', 'CT', 'ND'], dtype=object)

In [23]:
#mean 
df['Total day minutes'].mean()

179.77509750975116

#### Boolean indexing with one column - To get data based on some condition
  