# Exploratory Data Analysis using Python
Exploratory Data Analysis (EDA) is an approach in analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. EDA is very useful for:

1) Getting a better understanding of data<br>
2) Identifying various data patterns<br>
3) Getting a better understanding of the problem statement

<a href="https://pandas.pydata.org/">Pandas</a> is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, SQL, and more.

I'm using a Dataset which I found on <a href="https://www.kaggle.com/iamsouravbanerjee/analytics-industry-salaries-2022-india">Kaggle</a> in the following CSV format:

<code>Company Name,Job Title,Salaries Reported,Location,Salary
Mu Sigma,Data Scientist,105,Bangalore,"₹6,48,573/yr"
IBM,Data Scientist,95,Bangalore,"₹11,91,950/yr"
Tata Consultancy Services,Data Scientist,66,Bangalore,"₹8,36,874/yr"
Impact Analytics,Data Scientist,40,Bangalore,"₹6,69,578/yr"
Accenture,Data Scientist,32,Bangalore,"₹9,44,110/yr"
Infosys,Data Scientist,30,Bangalore,"₹9,08,764/yr"</code>

Before analysing this CSV file firstly we need to clean this data to identify incorrect or irrelevant parts of the data and then replacing, or deleting the dirty data. For instance- that ₹ symbol and /yr should be remove from Salary column. For that lets import pandas module and read CSV.

# Importing pandas & NumPy module with the alias pd and nm.

In [1]:
import pandas as pd
import numpy as nm

# Reading a CSV file using Pandas

In [6]:
# Reading CSV file in variable df
data_df = pd.read_csv("Dataset.csv")

In [7]:
# To find out the type of variable
type(data_df)

pandas.core.frame.DataFrame

In [8]:
# Lets extract some random rows from the dataset.
data_df.sample(10)

Unnamed: 0,Company Name,Job_Title,Salaries_Reported,Location,Salary
3572,Super Highway Labs,Data Engineer,1.0,New Delhi,"₹36,65,141/yr"
3807,Nones,Machine Learning Engineer,1.0,Bangalore,"₹12,08,801/yr"
3967,Tata Elxsi,Machine Learning Engineer,1.0,Pune,"₹4,32,559/yr"
1053,Tata Consultancy Services,Data Scientist,1.0,Pune,"₹25,000/mo"
4157,BYJU'S,Machine Learning Engineer,1.0,New Delhi,"₹22,19,746/yr"
3747,Tata Consultancy Services Big,Data Engineer,5.0,Mumbai,"₹5,50,000/yr"
3639,IFI Techsolutions,Data Engineer,2.0,Mumbai,"₹3,53,152/yr"
4278,BridgeLabz Solutions,Machine Learning Engineer,1.0,Mumbai,"₹22,304/mo"
3584,Boston Scientific,Data Engineer,1.0,New Delhi,"₹27,07,106/yr"
4343,vPhrase,Machine Learning Engineer,1.0,Mumbai,"₹9,39,843/yr"


The data is looking awesome but the salary column is not looking good since it contain different symbols and values apart from numeric value.

In [None]:
df.Salary.str.replace('₹', '', regex=True)
df.Salary.str.replace(',', '', regex=True)
df.Salary.str.replace('/yr', '', regex=True)
