# Exploratory Data Analysis using Python
Exploratory Data Analysis (EDA) is an approach in analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. EDA is very useful for:

1) Getting a better understanding of data<br>
2) Identifying various data patterns<br>
3) Getting a better understanding of the problem statement

<a href="https://pandas.pydata.org/">Pandas</a> is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, SQL, and more.

I'm using a Dataset which I found on <a href="https://www.kaggle.com/iamsouravbanerjee/analytics-industry-salaries-2022-india">Kaggle</a> in the following CSV format:

<code>Company Name,Job Title,Salaries Reported,Location,Salary
Mu Sigma,Data Scientist,105,Bangalore,"₹6,48,573/yr"
IBM,Data Scientist,95,Bangalore,"₹11,91,950/yr"
Tata Consultancy Services,Data Scientist,66,Bangalore,"₹8,36,874/yr"
Impact Analytics,Data Scientist,40,Bangalore,"₹6,69,578/yr"
Accenture,Data Scientist,32,Bangalore,"₹9,44,110/yr"
Infosys,Data Scientist,30,Bangalore,"₹9,08,764/yr"</code>

Before analysing this CSV file firstly we need to clean this data to identify incorrect or irrelevant parts of the data and then replacing, or deleting the dirty data. For instance- that ₹ symbol and /yr should be remove from Salary column. For that lets import pandas module and read CSV.

# Importing pandas & NumPy module with the alias pd and nm.

In [1]:
import pandas as pd
import numpy as nm

# Reading a CSV file using Pandas

In [26]:
# Reading CSV file in variable data_df
data_df = pd.read_csv("Dataset.csv")

In [3]:
# To find out the type of variable
type(data_df)

pandas.core.frame.DataFrame

In [4]:
# Lets extract some random rows from the dataset using sample()
data_df.sample(10)

Unnamed: 0,Company_Name,Job_Title,Salaries_Reported,Location,Salary
1529,Don't Know,Data Scientist,1.0,Hyderabad,"₹25,122/mo"
3356,BA Continuum India,Data Engineer,1.0,Hyderabad,"₹8,21,904/yr"
3507,Cognizant Technology Solutions,Data Engineer,2.0,New Delhi,"₹9,09,909/yr"
2309,Amadeus,Data Analyst,1.0,Pune,"₹40,195/mo"
1526,Electronic Arts,Data Scientist,1.0,Hyderabad,"₹9,39,649/yr"
311,Hewlett-Packard,Data Scientist,1.0,Bangalore,"₹22,04,142/yr"
3626,Mactores,Data Engineer,5.0,Mumbai,"₹5,42,342/yr"
3376,Tech Data,Data Engineer,1.0,Hyderabad,"₹51,799/mo"
2862,AIG,Data Analyst,2.0,Mumbai,"₹11,87,386/yr"
751,Hudl,Data Scientist,1.0,Bangalore,"₹3,00,978/yr"


The data is looking awesome but the salary column is not looking good since it include hourly, montly and yearly salary but we would need all yearly salary for analysis. Let clean & transform the data using pandas functions.

# Transforming Data using Pandas

In [27]:
# Lets remove the unwanted values and make salary column numeric using str()
data_df['wise'] = data_df['Salary'].str.slice(-2)
data_df.Salary = data_df.Salary.str.slice(1,-3)
data_df.Salary = data_df.Salary.str.replace(',','')
data_df.Salary = data_df.Salary.astype(float)

data_df

Unnamed: 0,Company_Name,Job_Title,Salaries_Reported,Location,Salary,wise
0,Mu Sigma,Data Scientist,105.0,Bangalore,648573.0,yr
1,IBM,Data Scientist,95.0,Bangalore,1191950.0,yr
2,Tata Consultancy Services,Data Scientist,66.0,Bangalore,836874.0,yr
3,Impact Analytics,Data Scientist,40.0,Bangalore,669578.0,yr
4,Accenture,Data Scientist,32.0,Bangalore,944110.0,yr
...,...,...,...,...,...,...
4338,TaiyōAI,Machine Learning Scientist,1.0,Mumbai,5180.0,mo
4339,Decimal Point Analytics,Machine Learning Developer,1.0,Mumbai,751286.0,yr
4340,MyWays,Machine Learning Developer,1.0,Mumbai,410952.0,yr
4341,Market Pulse Technologies,Software Engineer - Machine Learning,1.0,Mumbai,1612324.0,yr


In [23]:
data_df

Unnamed: 0,Company_Name,Job_Title,Salaries_Reported,Location,Salary,wise
0,Mu Sigma,Data Scientist,105.0,Bangalore,648573.0,yr
1,IBM,Data Scientist,95.0,Bangalore,1191950.0,yr
2,Tata Consultancy Services,Data Scientist,66.0,Bangalore,836874.0,yr
3,Impact Analytics,Data Scientist,40.0,Bangalore,669578.0,yr
4,Accenture,Data Scientist,32.0,Bangalore,944110.0,yr
...,...,...,...,...,...,...
4338,TaiyōAI,Machine Learning Scientist,1.0,Mumbai,62160.0,mo
4339,Decimal Point Analytics,Machine Learning Developer,1.0,Mumbai,751286.0,yr
4340,MyWays,Machine Learning Developer,1.0,Mumbai,410952.0,yr
4341,Market Pulse Technologies,Software Engineer - Machine Learning,1.0,Mumbai,1612324.0,yr


In [29]:

data_df.Salary.loc[data_df.wise=='mo'] = data_df.Salary.loc[data_df.wise=='mo'] * 12
data_df.Salary.loc[data_df.wise=='hr'] = data_df.Salary.loc[data_df.wise=='hr'] *8*5*4*12

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df.Salary.loc[data_df.wise=='mo'] = data_df.Salary.loc[data_df.wise=='mo'] * 12
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_df.Salary.loc[data_df.wise=='hr'] = data_df.Salary.loc[data_df.wise=='hr'] *8*5*4*12


In [30]:
data_df.Salary.loc[data_df.wise=='hr']

190     3047040.0
196     1850880.0
263     1603200.0
269      351360.0
478     2880000.0
581      192000.0
814     2893440.0
912     1920000.0
1036    1155840.0
1066    4110720.0
1891    1971840.0
1892     192000.0
2291     297600.0
2429    2424960.0
2698     341760.0
2930     289920.0
3617     132480.0
3670      49920.0
3713     384000.0
3834     216960.0
3875     119040.0
4061    5792640.0
Name: Salary, dtype: float64

In [6]:
# Now extract some more random rows from the dataset.
data_df.sample(10)

Unnamed: 0,Company_Name,Job_Title,Salaries_Reported,Location,Salary,wise
1801,Loylty Rewardz Mngt,Data Scientist,3.0,Mumbai,1223523.0,yr
2884,DONE by NONE,Data Analyst,1.0,Mumbai,201795.0,yr
2600,Optum Global Solutions,Data Analyst,6.0,New Delhi,101501.0,mo
140,Caterpillar,Data Scientist,3.0,Bangalore,1526335.0,yr
2734,Acuity Knowledge Partners,Data Analyst,2.0,New Delhi,508899.0,yr
1333,Kohler,Senior Data Scientist,1.0,Pune,1216773.0,yr
2657,Wizikey,Data Analyst,3.0,New Delhi,407238.0,yr
1495,Bharat Heavy Electricals,Data Scientist,1.0,Hyderabad,1668775.0,yr
2833,InSync Analytics,Data Analyst,2.0,Mumbai,17336.0,mo
2633,WheelsEye Technology,Data Analyst,4.0,New Delhi,625157.0,yr


Now Salary column is looking good as it now contain only Integers.

In [7]:
# Lets gets quick summary of dataset using info()
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4343 entries, 0 to 4342
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Company_Name       4340 non-null   object 
 1   Job_Title          4343 non-null   object 
 2   Salaries_Reported  4341 non-null   float64
 3   Location           4343 non-null   object 
 4   Salary             4343 non-null   float64
 5   wise               4343 non-null   object 
dtypes: float64(2), object(4)
memory usage: 203.7+ KB


info() function is used to get a concise summary of the dataframe. To get a quick overview of the dataset we use the dataframe.info() function.

In [8]:
# Lets generate descriptive statistics of the data using describe()
data_df.describe()

Unnamed: 0,Salaries_Reported,Salary
count,4341.0,4343.0
mean,2.776319,768129.6
std,5.14705,769308.5
min,1.0,26.0
25%,1.0,107966.0
50%,1.0,619354.0
75%,3.0,1098632.0
max,105.0,9568943.0


describe() is used to generate descriptive statistics of the data in a Pandas DataFrame or Series. describe() helps in getting a quick overview of the dataset.

In [9]:
data_df.Company_Name.value_counts()

Tata Consultancy Services     41
Amazon                        32
Accenture                     30
Google                        27
IBM                           26
                              ..
URS Technologies Solutions     1
Aniket Sonawane                1
Brahman bhetun                 1
Airavaat Car Rentals           1
Market Pulse Technologies      1
Name: Company_Name, Length: 2528, dtype: int64

In [10]:
data_df['Freq'] = data_df['Salary'].str.slice(-2)
data_df.Salary = data_df.Salary.str.slice(1,-3)
data_df.Salary = data_df.Salary.str.replace(',','')
data_df.Salary = data_df.Salary.astype(float)

data_df

AttributeError: Can only use .str accessor with string values!

In [None]:
type(data_df['Salary'])

In [None]:
data_df.Salary.loc[data_df.Freq=='mo'] * 12

In [None]:
data_df