EDA stands for Exploratory Data Analysis. It is a crucial step in the data analysis process where analysts or data scientists examine and explore the data to understand its characteristics, uncover patterns, detect anomalies, and gain insights. EDA involves using various statistical and visualization techniques to summarize and visualize the data, identify relationships between variables, and identify any data quality issues or missing values. By performing EDA, analysts can make informed decisions about data preprocessing, feature engineering, and modeling strategies.

there are total 3 steps in eda:
1. understand the data
2. clean the data
3. analyze the relationship between different variables

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
data= pd.read_csv('data1.csv')#read the data

In [4]:
#1. understanding the data

In [5]:
data.head()#to display the first 5 rows of the data
data.tail()#to display the last 5 rows of the data
data.describe()#method to display the summary statistics of the data

Unnamed: 0,APN,SITE COUNCIL DISTRICT,SITE #,SITE UNITS,PROJECT TOTAL UNITS,SH UNITS PER SITE,LAHD FUNDED,LEVERAGE,TAX EXEMPT CONDUIT BOND,TDC,JOBS,SITE LONGITUDE,SITE LATITUDE
count,595.0,595.0,595.0,595.0,595.0,595.0,595.0,595.0,595.0,595.0,415.0,595.0,595.0
mean,4931625000.0,8.42521,2.072269,54.868908,90.732773,20.235294,4466108.0,16059310.0,2952402.0,23477820.0,193.453012,-118.308551,34.054275
std,1153791000.0,4.631437,2.760317,49.132984,65.384237,32.16621,5053683.0,13217640.0,9713841.0,17725330.0,146.352206,0.077547,0.092567
min,2103009000.0,1.0,1.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,5.0,-118.60639,33.73614
25%,5039024000.0,6.0,1.0,21.0,49.0,0.0,0.0,7103994.0,0.0,12264280.0,84.5,-118.320325,34.01115
50%,5143020000.0,9.0,1.0,49.0,71.0,0.0,3225000.0,13100000.0,0.0,18869090.0,174.0,-118.28575,34.04781
75%,5511022000.0,13.0,1.0,75.5,102.0,37.5,6615000.0,22339640.0,3950000.0,29755000.0,261.5,-118.26329,34.086435
max,7455014000.0,15.0,20.0,438.0,438.0,262.0,39688210.0,94077720.0,184260400.0,223018100.0,1165.0,-118.16514,34.32402


In [6]:
data.shape#to display the number of rows and columns of the data

(595, 31)

In [7]:
data.columns#to display the columns of the data

Index(['APN', 'PROJECT NUMBER', 'NAME', 'DEVELOPMENT STAGE',
       'CONSTRUCTION TYPE', 'SITE ADDRESS', 'SITE  COUNCIL DISTRICT', 'SITE #',
       'SITE COMMUNITY', 'SITE UNITS', 'PROJECT TOTAL UNITS', 'HOUSING TYPE',
       'SUPPORTIVE HOUSING', 'SH UNITS PER SITE', 'DATE FUNDED', 'LAHD FUNDED',
       'LEVERAGE', 'TAX EXEMPT CONDUIT BOND', 'TDC', 'IN-SERVICE DATE',
       'DEVELOPER', 'MANAGEMENT COMPANY', 'CONTACT PHONE', 'PHOTO', 'JOBS',
       'PROJECT SUMMARY URL', 'CONTRACT NUMBERS', 'DATE STAMP',
       'SITE LONGITUDE', 'SITE LATITUDE', 'GPS_COORDS ON MAP'],
      dtype='object')

In [8]:
data.nunique()#to display the number of unique values in each column

APN                        584
PROJECT NUMBER             447
NAME                       595
DEVELOPMENT STAGE            2
CONSTRUCTION TYPE            7
SITE ADDRESS               591
SITE  COUNCIL DISTRICT      15
SITE #                      20
SITE COMMUNITY              87
SITE UNITS                 144
PROJECT TOTAL UNITS        132
HOUSING TYPE                 6
SUPPORTIVE HOUSING           2
SH UNITS PER SITE           91
DATE FUNDED                382
LAHD FUNDED                350
LEVERAGE                   396
TAX EXEMPT CONDUIT BOND    126
TDC                        440
IN-SERVICE DATE             26
DEVELOPER                  226
MANAGEMENT COMPANY         146
CONTACT PHONE              182
PHOTO                      312
JOBS                       242
PROJECT SUMMARY URL        595
CONTRACT NUMBERS           378
DATE STAMP                   1
SITE LONGITUDE             549
SITE LATITUDE              565
GPS_COORDS ON MAP          582
dtype: int64

In [9]:
data['DEVELOPMENT STAGE'].unique()#to display the unique values in the column 'DEVELOPMENT STAGE'

array(['In-Service', nan, 'Development'], dtype=object)

In [10]:
data['CONSTRUCTION TYPE'].unique()#to display the unique values in the column 'CONSTRUCTION TYPE'

array(['NEW CONSTRUCTION', 'REHAB', nan, 'ACQUISITION + REHAB',
       'ACQUISITION + NEW CONSTRUCTION', 'ACQUISITION ONLY',
       'BOTH REHAB AND NEW CONSTRUCTION', 'DEMO/NEW CONSTRUCTION'],
      dtype=object)

In [11]:
data['SUPPORTIVE HOUSING'].unique()#to display the unique values in the column 'SUPPORTIVE HOUSING'

array(['No', 'Yes'], dtype=object)

In [12]:
#2. Data Cleaning

In [13]:
data.isnull().sum()#to display the number of missing values in each column

APN                          0
PROJECT NUMBER               0
NAME                         0
DEVELOPMENT STAGE            1
CONSTRUCTION TYPE           34
SITE ADDRESS                 0
SITE  COUNCIL DISTRICT       0
SITE #                       0
SITE COMMUNITY               7
SITE UNITS                   0
PROJECT TOTAL UNITS          0
HOUSING TYPE                22
SUPPORTIVE HOUSING           0
SH UNITS PER SITE            0
DATE FUNDED                  1
LAHD FUNDED                  0
LEVERAGE                     0
TAX EXEMPT CONDUIT BOND      0
TDC                          0
IN-SERVICE DATE              0
DEVELOPER                   26
MANAGEMENT COMPANY          55
CONTACT PHONE               84
PHOTO                        0
JOBS                       180
PROJECT SUMMARY URL          0
CONTRACT NUMBERS           148
DATE STAMP                   0
SITE LONGITUDE               0
SITE LATITUDE                0
GPS_COORDS ON MAP            0
dtype: int64

In [27]:
threshold = 0.3
data_cleaned = data.dropna(thresh=int((1 - threshold) * len(data)), axis=1)
#this code will drop columns with more than 30% missing values

In [28]:
data_cleaned.columns.size#to display the number of columns in the data

31

In [29]:
data_cleaned = data_cleaned.dropna(thresh=15)  # Keep rows with at least 15 non-NaN values
data_cleaned.shape#to display the number of rows and columns of the data

(595, 31)

In [30]:
data_cleaned['CONSTRUCTION TYPE'].fillna(data_cleaned['CONSTRUCTION TYPE'].mode()[0], inplace=True)
data_cleaned['JOBS'].fillna(data_cleaned['JOBS'].mean(), inplace=True)


In [31]:
data_cleaned['DEVELOPMENT STAGE'].fillna(data_cleaned['DEVELOPMENT STAGE'].mode()[0], inplace=True)
data_cleaned['HOUSING TYPE'].fillna(data_cleaned['HOUSING TYPE'].mode()[0], inplace=True)


In [32]:
data_cleaned['CONTACT PHONE'].fillna('Unknown', inplace=True)
data_cleaned['MANAGEMENT COMPANY'].fillna('Unknown', inplace=True)


In [33]:
columns_to_drop = ['APN', 'PROJECT NUMBER', 'NAME', 'SITE ADDRESS', 'SITE #', 'DATE FUNDED', 
                   'CONTACT PHONE', 'PHOTO', 'DATE STAMP', 'GPS_COORDS ON MAP']

data_cleaned = data_cleaned.drop(columns=columns_to_drop)


In [34]:
data_cleaned.shape#to display the number of rows and columns of the data

(595, 21)

In [36]:
data_cleaned.columns#to display the columns of the data

Index(['DEVELOPMENT STAGE', 'CONSTRUCTION TYPE', 'SITE  COUNCIL DISTRICT',
       'SITE COMMUNITY', 'SITE UNITS', 'PROJECT TOTAL UNITS', 'HOUSING TYPE',
       'SUPPORTIVE HOUSING', 'SH UNITS PER SITE', 'LAHD FUNDED', 'LEVERAGE',
       'TAX EXEMPT CONDUIT BOND', 'TDC', 'IN-SERVICE DATE', 'DEVELOPER',
       'MANAGEMENT COMPANY', 'JOBS', 'PROJECT SUMMARY URL', 'CONTRACT NUMBERS',
       'SITE LONGITUDE', 'SITE LATITUDE'],
      dtype='object')

In [37]:
data_cleaned.isnull().sum()#to display the number of missing values in each column

DEVELOPMENT STAGE            0
CONSTRUCTION TYPE            0
SITE  COUNCIL DISTRICT       0
SITE COMMUNITY               7
SITE UNITS                   0
PROJECT TOTAL UNITS          0
HOUSING TYPE                 0
SUPPORTIVE HOUSING           0
SH UNITS PER SITE            0
LAHD FUNDED                  0
LEVERAGE                     0
TAX EXEMPT CONDUIT BOND      0
TDC                          0
IN-SERVICE DATE              0
DEVELOPER                   26
MANAGEMENT COMPANY           0
JOBS                         0
PROJECT SUMMARY URL          0
CONTRACT NUMBERS           148
SITE LONGITUDE               0
SITE LATITUDE                0
dtype: int64

In [38]:
data['SITE COMMUNITY'].fillna(data['SITE COMMUNITY'].mode()[0], inplace=True)


Unnamed: 0,DEVELOPMENT STAGE,CONSTRUCTION TYPE,SITE COUNCIL DISTRICT,SITE COMMUNITY,SITE UNITS,PROJECT TOTAL UNITS,HOUSING TYPE,SUPPORTIVE HOUSING,SH UNITS PER SITE,LAHD FUNDED,...,TAX EXEMPT CONDUIT BOND,TDC,IN-SERVICE DATE,DEVELOPER,MANAGEMENT COMPANY,JOBS,PROJECT SUMMARY URL,CONTRACT NUMBERS,SITE LONGITUDE,SITE LATITUDE
0,In-Service,NEW CONSTRUCTION,1,WESTLAKE,196,196,SENIORS,No,0,0.0,...,0,0.0,2003,,GSL PROPERRTY MANAGEMENT,193.453012,click here (http://hcidapp.lacity.org/ahtfRepo...,,-118.26584,34.05235
1,In-Service,REHAB,10,CRENSHAW DISTRICT,0,257,FAMILY,No,0,0.0,...,10208936,17312930.0,2006,"HAMPSTEAD PARTNERS, INC.","ALPHA PROPERTY MANAGEMENT, INC.",193.453012,click here (http://hcidapp.lacity.org/ahtfRepo...,,-118.34182,34.03071
2,In-Service,NEW CONSTRUCTION,9,CENTRAL,0,74,SPECIAL NEEDS,Yes,0,9389115.63,...,0,45471107.63,2021,Hollywood Community Housing Corporation,BARKER MANAGEMENT INCORPORATED,226.0,click here (http://hcidapp.lacity.org/ahtfRepo...,C-129358,-118.2574,34.01115
3,In-Service,NEW CONSTRUCTION,8,HYDE PARK,55,55,SENIORS,No,0,5281147.0,...,0,13709884.0,2009,Abode Communities previously known as LA COMMU...,ABODE COMMUNITIES,110.0,click here (http://hcidapp.lacity.org/ahtfRepo...,C-111486,-118.33139,33.97355
4,In-Service,NEW CONSTRUCTION,13,TEMPLE-BEAUDRY,49,49,FAMILY,No,0,2846000.0,...,0,13711989.0,2008,"AMERICAN COMMUNITIES, LLC",THE JOHN STEWART COMPANY,95.0,click here (http://hcidapp.lacity.org/ahtfRepo...,C-109452,-118.26086,34.06173
