# Exploratory Data Analysis using Python
Exploratory Data Analysis (EDA) is an approach in analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. EDA is very useful for:

1) Getting a better understanding of data<br>
2) Identifying various data patterns<br>
3) Getting a better understanding of the problem statement

<a href="https://pandas.pydata.org/">Pandas</a> is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, SQL, and more.

I'm using a Dataset which I found on <a href="https://www.kaggle.com/iamsouravbanerjee/analytics-industry-salaries-2022-india">Kaggle</a> in the following CSV format:

<code>Company Name,Job Title,Salaries Reported,Location,Salary
Mu Sigma,Data Scientist,105,Bangalore,"₹6,48,573/yr"
IBM,Data Scientist,95,Bangalore,"₹11,91,950/yr"
Tata Consultancy Services,Data Scientist,66,Bangalore,"₹8,36,874/yr"
Impact Analytics,Data Scientist,40,Bangalore,"₹6,69,578/yr"
Accenture,Data Scientist,32,Bangalore,"₹9,44,110/yr"
Infosys,Data Scientist,30,Bangalore,"₹9,08,764/yr"</code>

Before analysing this CSV file firstly we need to clean and transform this data to identify incorrect or irrelevant parts of the data and then replacing, or deleting the dirty data. For that lets import pandas module and read CSV.

# Importing pandas & NumPy module with the alias pd and nm.

In [1]:
import pandas as pd
import numpy as nm
pd.options.mode.chained_assignment = None  # It will not show any unnecessary warning. default='warn'

# Reading a CSV file using Pandas

In [2]:
# Reading CSV file in variable data_df
data_df = pd.read_csv("Dataset.csv")

Data from the CSV file is read and stored in a DataFrame object - one of the core data structures in Pandas for storing and working with tabular data. We typically use the _df suffix in the variable names for dataframes. 

In [3]:
# To find out the type of variable
type(data_df)

pandas.core.frame.DataFrame

In [4]:
# Lets extract 10 random rows from the dataset using sample()
data_df.sample(10)

Unnamed: 0,Company_Name,Job_Title,Salaries_Reported,Location,Salary
2877,Unilever,Data Analyst,1.0,Mumbai,"₹53,744/yr"
1931,Publicis Groupe,Data Scientist,1.0,Mumbai,"₹6,73,381/yr"
3574,ASSAD,Data Engineer,1.0,New Delhi,"₹5,17,986/yr"
2141,MH Alshaya,Data Analyst,2.0,Bangalore,"₹7,55,874/yr"
1351,Freshers.com,Junior Data Scientist,1.0,Pune,"₹12,38,138/yr"
2219,Bajaj Allianz General Insurance Co,Data Analyst,2.0,Pune,"₹4,92,899/yr"
3085,GainInsights Solutions,Data Engineer,1.0,Bangalore,"₹10,00,000/yr"
3924,Kruzr,Machine Learning Engineer,1.0,Bangalore,"₹20,17,951/yr"
2255,First Student,Data Analyst,2.0,Pune,"₹2,77,519/yr"
516,GoodWorkLabs Services,Data Scientist,1.0,Bangalore,"₹11,24,063/yr"


The data is looking awesome but the salary column is not looking good since it include hourly, montly and yearly salary but we would need all yearly salary for analysis. Let clean & transform the data using pandas functions.

# Transforming Data using Pandas

In [5]:
# Creating temp column 'wise' to store 'yr', 'mo', 'hr' from salary
data_df['wise'] = data_df['Salary'].str.slice(-2)

# Lets remove the unwanted values and make salary column numeric using str()
data_df.Salary = data_df.Salary.str.slice(1,-3)
data_df.Salary = data_df.Salary.str.replace(',','')

# Converting salary coumn to float
data_df.Salary = data_df.Salary.astype(float)

data_df

Unnamed: 0,Company_Name,Job_Title,Salaries_Reported,Location,Salary,wise
0,Mu Sigma,Data Scientist,105.0,Bangalore,648573.0,yr
1,IBM,Data Scientist,95.0,Bangalore,1191950.0,yr
2,Tata Consultancy Services,Data Scientist,66.0,Bangalore,836874.0,yr
3,Impact Analytics,Data Scientist,40.0,Bangalore,669578.0,yr
4,Accenture,Data Scientist,32.0,Bangalore,944110.0,yr
...,...,...,...,...,...,...
4338,TaiyōAI,Machine Learning Scientist,1.0,Mumbai,5180.0,mo
4339,Decimal Point Analytics,Machine Learning Developer,1.0,Mumbai,751286.0,yr
4340,MyWays,Machine Learning Developer,1.0,Mumbai,410952.0,yr
4341,Market Pulse Technologies,Software Engineer - Machine Learning,1.0,Mumbai,1612324.0,yr


In [6]:
# Convert montly salary to yearly salary where wise is equal to mo
data_df.Salary.loc[data_df.wise=='mo'] = data_df.Salary.loc[data_df.wise=='mo'] * 12

# Convert hourly salary to yearly salry where wise is equal to hr
data_df.Salary.loc[data_df.wise=='hr'] = data_df.Salary.loc[data_df.wise=='hr'] *8*5*4*12

# Created a new column 'Yearly_Salary' to store all Yearly Salary
data_df['Yearly_Salary'] = data_df.Salary

# Deleted previous Salary column as that is not required now.
data_df = data_df.drop(['Salary', 'wise'], axis=1)

# Now again extract 10 random rows from the dataset for rechecking
data_df.sample(10)

Unnamed: 0,Company_Name,Job_Title,Salaries_Reported,Location,Yearly_Salary
2592,ABC,Data Analyst,6.0,New Delhi,538846.0
4129,Phenom,Machine Learning Engineer,1.0,Hyderabad,187728.0
3636,GEP,Data Engineer,2.0,Mumbai,1321470.0
3498,Freelancer,Data Engineer,2.0,New Delhi,1271011.0
2974,Brillio,Data Engineer,6.0,Bangalore,514393.0
3755,Infosys,Machine Learning Engineer,2.0,Bangalore,832266.0
3515,Innovaccer,Data Engineer,1.0,New Delhi,443306.0
2478,Goldman Sachs,Data Analyst,1.0,Hyderabad,455550.0
1913,Angel One,Data Scientist,1.0,Mumbai,1513463.0
2060,EMPLOYERS,Data Analyst,4.0,Bangalore,316821.0


Now data is looking good as it now include Yearly Salary

In [7]:
# Lets gets quick summary of dataset using info()
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4343 entries, 0 to 4342
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Company_Name       4340 non-null   object 
 1   Job_Title          4343 non-null   object 
 2   Salaries_Reported  4341 non-null   float64
 3   Location           4343 non-null   object 
 4   Yearly_Salary      4343 non-null   float64
dtypes: float64(2), object(3)
memory usage: 169.8+ KB


info() function is used to get a concise summary of the dataframe. To get a quick overview of the dataset we use the dataframe.info() function.

In [8]:
# Lets generate descriptive statistics of the data using describe()
data_df.describe()

Unnamed: 0,Salaries_Reported,Yearly_Salary
count,4341.0,4343.0
mean,2.776319,915111.9
std,5.14705,857526.8
min,1.0,10814.0
25%,1.0,417179.0
50%,1.0,703418.0
75%,3.0,1184878.0
max,105.0,18807950.0


describe() is used to generate descriptive statistics of the data in a Pandas DataFrame or Series. describe() helps in getting a quick overview of the dataset.