# Inter-American Development Bank
* The Inter-American Development Bank (IADB or IDB or BID) is the largest source of development financing for Latin America and the Caribbean. Established in 1959, the IDB supports Latin American and Caribbean economic development, social development and regional integration by lending to governments and government agencies, including State corporations.

## IDB - Social Protection

* Poverty is a multidimensional phenomenon characterized by the presence of unsatisfied basic needs that are the result of a large number of factors. At the same time, poverty is the cause and consequence of social exclusion, which is understood as a situation that impedes people from attaining a minimum level of well-being, developing their potential, and participating on equal terms in the social, political and economic aspects of life.  Age, ethnicity, gender, a condition of dependency, and exposure to family violence are some of the factors associated with social exclusion.

* The goal of the IDB’s Division of Social Protection and Health is to promote the social inclusion of people who are living in poverty and vulnerability, as well as to support minimum consumption levels among the population living in extreme poverty through programs that promote capacity-building among four population groups.

### IDB COUNTRY STRATEGY WITH COSTA RICA
The Country Strategy has the objective of contributing to the government’s actions to achieve higher, more inclusive and sustainable growth and speed up the pace of poverty reduction. The Bank will support the country in attaining the objectives set out in its National Development Plan (PND) 2015-2018 and in the IDB’s Institutional Strategy. To that end, the Bank’s activities will
focus on four strategic objectives:
*   supporting fiscal sustainability and efficient spending;
*  improving productive infrastructure quality, efficiency, and sustainability;
*  boosting the competitiveness of small and medium-sized enterprises; and
*  strengthening the human capital accumulation strategy.


In [None]:
#Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
from PIL import  Image
%matplotlib inline
import pandas as pd
import seaborn as sns
import itertools
import warnings
warnings.filterwarnings("ignore")
from wordcloud import WordCloud,STOPWORDS
import io
import base64
from matplotlib import rc,animation
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.basemap import Basemap
import folium
import folium.plugins
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import os
print(os.listdir("../input"))

In [None]:
img = np.array(Image.open(r"../input/image-idb/logo_ingles_blanco_sin descriptor.png"))
fig = plt.figure(figsize=(13,13))
plt.imshow(img,interpolation="bilinear")
plt.axis("off")
fig.set_facecolor("navy")
plt.show()

## Index
- <a href='#1'>1. Data-overview</a>
    - <a href="#1-1">1.1 Data dimensions</a>
    - <a href="#1-2">1.2 Percentage of Missing values</a>
- <a href="#2">2. Dependent Variable</a>
- <a href="#3" > 3. Correlation </a>
    - <a href="#3-1">3.1 Positive correlated variables </a>
    - <a href="#3-2"> 3.2.  Negatively correlated variables </a>
- <a href="#4" > 4 . Monthly rent payment</a>
    - <a href="#4-1" > 4.1. Distribution of monthly rent payment by poverty levels</a>
    - <a href="#4-2" > 4.2. Average Monthly rent by persons and rooms in a house hold </a>
- <a href="#5" >5 . Gender Distribution by Poverty levels</a>
- <a href="#6" >6 . Households and Persons Per Household</a>
    - <a href="#6-1" >6.1. Percentage of poverty level by persons in house hold</a>
    - <a href="#6-2" > 6.2. Total persons by house hold below and above 12 years</a>
    - <a href="#6-3" >6.3. Percentage of Gender by persons in house hold</a>
    - <a href="#6-4" >6.4. Gender by house hold below and above 12 years</a>
    - <a href="#6-5" > 6.5.  Persons living in house holds with respect to house hold size</a>
    - <a href="#6-6" > 6.6.  Persons living in household by categories</a>
    - <a href="#6-7" > 6.7. House Ownership types</a>
- <a href="#7" >7 . Over Crowding </a>
    - <a href="#7-1" > 7.1. Percentage of poverty level by rooms in house hold</a>
    - <a href="#7-2" >7.2.  Over Crowding by Poverty Categories</a>
    - <a href="#7-3" >7.3.  Over crowding by poverty levels</a>
    - <a href="#7-4" >7.4. Summary stats on over crowding squared  by poverty levels</a>
- <a href="#8" >8 . AGE </a>
    - <a href="#8-1" >8.1. Age Distribution by Poverty levels </a>
    - <a href="#8-2" > 8.2.  Age distribution by gender</a>
- <a href="#9" >9 . EDUCATION </a>
    - <a href="#9-1" >9.1. Years of schooling by poverty levels </a>
    - <a href="#9-2" > 9.2. Summary stats on years of schooling by poverty levels</a>
    - <a href="#9-3" >9.3. Years behind schooling by poverty levels </a>
    - <a href="#9-4" >  9.4. Years of education by gender. </a>
    - <a href="#9-5" >9.5. average years of education for adults (18+) by poverty levels</a>
    - <a href="#9-6" > 9.6. Education Level Frequency distribtion </a>
    - <a href="#9-7" > 9.7. Education levels by gender </a>
    - <a href="#9-8" >9.8. Education levels by poverty category</a>
- <a href="#10" >10 . Walls </a>
    - <a href="#10-1" >10.1. Predominant material on the outside wall </a>
    - <a href="#10-2" > 10.2. Wall condition by predominant material on the outside wall</a>
    - <a href="#10-3" >10.3. Poverty  by wall condition  </a>
- <a href="#11" >11 . Flooring</a>
    - <a href="#11-1" >11.1. Predominant material on the floor </a>
    - <a href="#11-2" > 11.2. Floor condition by predominant material on the floor</a>
    - <a href="#11-3" >11.3. Poverty level  by floor  condition  </a>
- <a href="#12" >12.CEILING </a>
    - <a href="#12-1" >12.1. Predominant material on the ceiling </a>
    - <a href="#12-2" > 12.2. Cieling condition by predominant material on the roof</a>
    - <a href="#12-3" >12.3.poverty level by ceiling condition   </a>
- <a href="#13" >13.Electricity & water provision </a>
    - <a href="#13-1" >13.1.Electricity Source in the dwelling </a>
    - <a href="#13-2" >13.2.Water Source type in the dwelling</a>
    - <a href="#13-3" >13.3.Energy Source for cooking in the dwelling   </a>
    - <a href="#13-4" >13.4.Percentage of poverty level by energy source   </a>
- <a href="#14" >14.Sanitation </a>
    - <a href="#14-1" >14.1.Sanitation system type in the dwelling </a>
    - <a href="#14-2" >14.2.Elimination type in the dwelling </a>
- <a href="#15" >15.Civil Status and Relations </a>
    - <a href="#15-1" >15.1.Civil status category by the householders </a>
    - <a href="#15-2" >15.2.Relation status category by the householders </a>
- <a href="#16" >16.Dependency(squared) </a>
    - <a href="#16-1" >16.1.Average squared dependency for poverty levels </a>
    - <a href="#16-2" >16.2.Average squared dependency by education levels </a>
- <a href="#17" >17.Zones </a>
    - <a href="#17-1" >17.1. zone percentage in training data </a>
    - <a href="#17-2" >17.2.  urban and rural percentage by poverty levels </a>
- <a href="#18" > 18 . Socio-economic regions of Costa Rica </a>
    - <a href="#18-1" >18.1. Poverty level percentage by Region </a>
    - <a href="#18-2" > 18.2. Average monthly rent payment by regions</a>
    - <a href="#18-3" > 18.3. overcrowding  by regions  </a>
    - <a href="#18-4" > 18.4.  Average Squared dependency  by regions</a>
    - <a href="#18-5" >18.5. Rural and urban percentage by regions  </a>


# <a id='1'>1 . Data - overview </a>

In [None]:
#Importing Data
train = pd.read_csv(r"../input/costa-rican-household-poverty-prediction/train.csv")
test  = pd.read_csv(r"../input/costa-rican-household-poverty-prediction/test.csv")
display ("training - data")
display (train.head(3))
display ("testing - data")
display (test.head(3))

## <a id='1-1'>1.1  Data Dimensions  </a>

In [None]:
print ("TRAIN DATA - ","Rows : ", train.shape[0]," , Columns : ",train.shape[1])
print ("TEST  DATA - ","Rows : ", test.shape[0],", Columns : ",test.shape[1])

## <a id='1-2'>1.2  Percentage of Missing values </a>
Missing data variables
1. v2a1	  -  Monthly rent payment.
2. v18q1	 -  number of tablets household owns.
3. rez_esc	-  Years behind in school.
4. meaneduc	average years of education for adults (18+).
5. education of adults (>=18) in the household

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(121)
sns.heatmap((train.isnull().sum()[(train.isnull().sum() > 0)]*100 / train.shape[0]).to_frame()
            ,annot=True,cmap = "cool",linewidth = 1,linecolor = "k",fmt = "f")
plt.title("%ge of missing values in train data")

plt.subplot(122)
sns.heatmap((test.isnull().sum()[(test.isnull().sum() > 0)]*100 / test.shape[0]).to_frame(),
           annot=True,linewidth = 1,linecolor = "k",fmt = "f",cmap = "cool")
plt.title("%ge of missing values in test data")
plt.subplots_adjust(wspace = .7)

## <a id='2'>2 .Dependent Variable </a>
1 . Target - the target is an ordinal variable indicating groups of income levels. 
* 1 =  extreme poverty .
* 2 = moderate poverty .
* 3 = vulnerable households .
* 4 = non vulnerable households.                                                                                                                                                                         

2 . 8% of training data contains people living in extreme poverty and 17% with moderate poverty level .

In [None]:
plt.figure(figsize=(13,6))
plt.subplot(121)
train["Target"].value_counts().plot.pie(autopct = "%1.0f%%",wedgeprops = {"linewidth":1,"edgecolor":"white"},
                                       colors = sns.color_palette("husl"),
                                       labels = ["Non vulnerable","Moderate","Vulnerable","Extreme"])
plt.ylabel("")
plt.title("Proportion of Traget variable")


plt.subplot(122)
ax = sns.countplot(y  = train["Target"] , order = train["Target"].value_counts().index,
                   linewidth = 1,edgecolor = "k"*4, palette = "Set1")

for i,j in enumerate(train["Target"].value_counts().values):
    ax.text(.7,i,j,fontsize = 12)
plt.grid(True,alpha=  .1)
plt.title("Count for target variable")

plt.show()

# <a id='3'>3 . CORRELATION  </a>
## <a id='3-1'>3.1. Positive correlated variables </a>
* pearson correlation coefficient greater than .8

In [None]:
correlation = train.corr()

links = correlation.stack().reset_index()
links.columns = ["var1","var2","value"]

filtered_links_pos = links.loc[(links["value"] >  0.8) & (links["var1"] != links["var2"])]
filtered_links_neg = links.loc[(links["value"] <  -0.8) & (links["var1"] != links["var2"])]

pos_corr = train[filtered_links_pos["var1"].unique()].corr()
neg_corr = train[filtered_links_neg["var1"].unique()].corr()

mask = np.zeros_like(pos_corr,dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(13,10))
sns.heatmap(pos_corr,mask=mask,center=0,square=True,linewidth = 1)
plt.title("Positively correlated variables")
plt.show()


## <a id='3-2'>3.2.  Negatively correlated variables </a>
* pearson correlation coefficient less than than -0.8

In [None]:
mask = np.zeros_like(neg_corr,dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(12,9))
sns.heatmap(neg_corr,mask=mask,center=0,square=True,linewidth = 1)
plt.title("Negatively correlated variables")
plt.show()



# <a id='4'>4 . Monthly rent payment </a>

## <a id='4-11'>4.1. Distribution of monthly rent payment by poverty levels. </a>
* 71% of data contains missing values for this column.
* Distribution of monthly rent payment by paverty levels with non missing values.

In [None]:
df = train.copy()
df["Target"] = df["Target"].map({1 : "extreme poverty",
                                 2 : "moderate poverty",
                                 3 : "vulnerable",
                                 4 : "non vulnerable"}) 

pov_lst  = ["extreme poverty" , "moderate poverty" , "vulnerable" , "non vulnerable"]
length   = len(pov_lst)
cs       = ["r","lime","b","m"]

plt.figure(figsize=(13,12))
for i,j,k in itertools.zip_longest(pov_lst,range(length),cs):
    plt.subplot(2,2,j+1)
    ax = sns.kdeplot(df[(df["v2a1"].notnull()) & (df["Target"] == i)]["v2a1"] ,
                    label = i ,shade =True,color = k)
    plt.axvline(df[(df["v2a1"].notnull()) & (df["Target"] == i)]["v2a1"].mean(),
                color = "w" ,linestyle = "dashed",label = "mean")
    plt.axvline(df[(df["v2a1"].notnull()) & (df["Target"] == i)]["v2a1"].std(),
                color = "grey" ,linestyle = "dashed",label = "Std")
    plt.legend(loc = "best" , prop = {"size" : 10})
    plt.title("Monthly rent payment - " + i)
    ax.set_facecolor("k")
    plt.grid(True,alpha = .1)
    plt.xlabel("Monthly rent")

## <a id='4-2'>4.2. Average Monthly rent by persons and rooms in a house hold </a>

In [None]:
pvt = pd.pivot_table(data=df,aggfunc="mean",columns="rooms",index="r4t3",values="v2a1")

plt.figure(figsize=(13,10))
sns.heatmap(pvt,annot=True,linecolor = "k",linewidth = 1)
plt.xlabel("TOTAL ROOMS IN HOUSE HOLD",color = "b")
plt.ylabel("TOTAL PERSONS LIVING IN HOUSE HOLD",color = "b")
plt.title("Average Monthly rent by persons and rooms in a house hold",color = "b")
plt.show()

## <a id='5'>5 . Gender Distribution by Poverty levels </a>
* In extreme and moderate poverty levels female percent is higher than male percent

In [None]:
pov_lst
length = len(pov_lst)

plt.figure(figsize=(12,12))
for i,j in itertools.zip_longest(pov_lst,range(length)):
    plt.subplot(2,2,j+1)
    df[df["Target"]  ==  i]["male"].value_counts().plot.pie(autopct = "%1.0f%%",wedgeprops = {"linewidth":2,"edgecolor":"white"},
                                                            colors = ["orangered","royalblue"] , 
                                                            labels = ["female","male"],
                                                            shadow = True)
    plt.title(i,color = "b")
    plt.ylabel("")
    

# <a id='6'>6 . Households and Persons Per Household </a>

## <a id='6-1'>6.1. Percentage of poverty level by persons in house hold </a>
* percentage of poverty tends to increase as count of persons in a house hold increase

In [None]:
#r4t3	 Total persons in the household
df = df.rename(columns = {"r4t3" : "total_hh_persons"})

per = pd.crosstab(df["total_hh_persons"],df["Target"]).apply(lambda r:r/r.sum()*100,axis =1)
ax = per.plot(kind = "bar",stacked =True,figsize = (12,7),
              colors = sns.color_palette("husl"),alpha = .9,
              linewidth = 1,edgecolor = "w"*df["total_hh_persons"].nunique())
ax.set_facecolor("k")
plt.xlabel("Total persons in the household")
plt.ylabel("percentage")
plt.legend(loc = "best",prop = {"size" : 13})
plt.xticks(rotation = 0)
plt.grid(True,alpha = .2)
plt.title("Percentage of poverty level by persons in house hold")
plt.show()

## <a id='6-2'>6.2. Total persons by house hold below and above 12 years  </a>
* r4t1	 -  persons younger than 12 years of age.
* r4t2	- persons 12 years of age and older.

In [None]:
#df[["r4h1","r4h2","hh_males","r4m1","r4m2","hh_females","r4t1","r4t2","total_hh_persons"]]
df = df.rename(columns = {"r4h1" : "lt_12_males" , "r4h2": "gt_12_males" ,
                          "r4m1" : "lt_12_females" , "r4m2" : "gt_12_females",
                          "r4t1" : "lt_12_total_persons" , "r4t2" : "gt_12_total_persons"} )


plt.figure(figsize=(12,6))
ax = sns.kdeplot(df["lt_12_total_persons"],color = "b" , linewidth = 2 ,shade=True ,
                 label = "persons younger than 12 years")
ax = sns.kdeplot(df["gt_12_total_persons"],color = "r" , linewidth = 2 ,shade=True ,
                 label = "persons older than 12 years")
ax.set_facecolor("lightgrey")
plt.legend(loc = "best" , prop = {"size" : 12})
plt.title("Total persons by house hold below and above 12 years")
plt.ylabel("count")
plt.xlabel("total persons")
plt.xticks(np.arange(0, 13,1))
plt.grid(alpha = .3)
plt.show()

## <a id='6-3'>6.3. Percentage of Gender by persons in house hold </a>

In [None]:
df[["r4h3","r4m3","total_hh_persons"]]
df = df.rename(columns = {"r4h3":"hh_males","r4m3":"hh_females"} )

m = df["hh_males"].value_counts().reset_index()
f = df["hh_females"].value_counts().reset_index()
x = m.merge(f,left_on="index",right_on="index",how= "left").sort_values(by = "index",ascending = True)
x["male_percent"]    = x["hh_males"]*100/(x["hh_males"]+x["hh_females"])
x["female_percent"]  = x["hh_females"]*100/(x["hh_males"]+x["hh_females"])
x.index = x["index"]

ax = x[['male_percent',"female_percent"]].plot(kind = "bar",stacked =True,
                                               figsize = (12,6) , linewidth = 2,
                                              edgecolor = "w"*len(x))

ax.set_facecolor("k")
plt.ylabel("percentage")
plt.xlabel("persons in house hold")
plt.legend(loc = "best",prop = {"size" : 13})
plt.xticks(rotation = 0)
plt.grid(True,alpha = .2)
plt.title("Percentage of Gender by persons in house hold")
plt.show()

##  <a id='6-4'>6.4. Gender by house hold below and above 12 years </a>

In [None]:

plt.figure(figsize=(13,5))
plt.subplot(121)
ax = sns.kdeplot(df["lt_12_males"],color = "b" , linewidth = 2 ,shade=True ,
                 label = "males younger than 12 years")
ax = sns.kdeplot(df["gt_12_females"],color = "r" , linewidth = 2 ,shade=True ,
                 label = "females older than 12 years")
plt.xticks(np.arange(0,9,1))
ax.set_facecolor("lightgrey")
plt.title("males by house hold below and above 12 years")
plt.xlabel("persons")
plt.grid(True,alpha = .3)

plt.subplot(122)
ax1 = sns.kdeplot(df["lt_12_males"],color = "b" , linewidth = 2 ,shade=True ,
                 label = "females younger than 12 years")
ax1 = sns.kdeplot(df["gt_12_females"],color = "r" , linewidth = 2 ,shade=True ,
                 label = "females younger than 12 years")
plt.title("females by house hold below and above 12 years")
plt.xlabel("persons")
plt.grid(alpha = .3)
ax1.set_facecolor("lightgrey")

## <a id='6-5'>6.5.  Persons living in house holds with respect to house hold size  </a>
* tamhog - size of the household
* tamviv - number of persons living in the household

In [None]:
#tamhog	 size of the household
#tamviv	 number of persons living in the household

df = df.rename(columns = {"tamhog":"hh_size","tamviv":"persons_living_hh"})

df = df.drop(columns = {'hhsize'},axis = 1 )

pov_lst
length 


plt.figure(figsize=(13,13))
for i,j in itertools.zip_longest(pov_lst,range(length)):
    plt.subplot(2,2,j+1)
    ax = sns.kdeplot(df[df["Target"] == i]["hh_size"],label = "house size",
                     shade=True,linewidth = 2,color = "#FF3310")
    ax = sns.kdeplot(df[df["Target"] == i]["persons_living_hh"],label = "persons living",
                     shade=True,linewidth = 2,color = "#FFFF00")
    plt.title(i,color = "b")
    ax.set_facecolor("k")
    plt.legend(prop = {'size' : 13})

## <a id='6-6'>6.6. persons living in household by categories </a>

In [None]:
# hogar_nin	 Number of children 0 to 19 in household
# hogar_adul	 Number of adults in household
# hogar_mayor	 # of individuals 65+ in the household
# hogar_total	 # of total individuals in the household
#
home_cols = ['hogar_nin', 'hogar_adul', 'hogar_total', 'hogar_mayor']
lab = ["children 0 to 19 in household" , "adults in household" , 
       "total individuals in the household" ,"individuals 65+ in the household"]

df[home_cols]

plt.figure(figsize=(13,10))
for i,j,k in itertools.zip_longest(home_cols,range(len(home_cols)),lab):
    plt.subplot(2,2,j+1)
    ax = sns.kdeplot(df[i],shade=True,color = "w",label = k)
    ax.set_facecolor("k")
    plt.axvline(df[i].mean(),color = "b",
                linestyle = "dashed",label = "Avg")
    plt.legend(loc = "best")
    plt.grid(True,alpha = .1)
    plt.title(k)

## <a id='6-7'>6.7.House Ownership types </a>

In [None]:
# tipovivi1	 =1 own and fully paid house
# tipovivi2	 "=1 own   paying in installments"
# tipovivi3	 =1 rented
# tipovivi4	 =1 precarious
# tipovivi5	 "=1 other(assigned   borrowed)"


hh_kind_cols = df.columns[df.columns.str.contains("tipoviv")].tolist()
hh_kind_cols

def hh_kind_lab(df) :
    if df["tipovivi1"] == 1 :
        return "Own house(fully paid)"
    elif df["tipovivi2"] == 1 :
        return "Own house(installment pays)"
    elif df["tipovivi3"] == 1 :
        return "Rented"
    elif df["tipovivi4"] == 1 :
        return "precarious"
    elif df["tipovivi5"] == 1 :
        return "Borrowed"

df["house_kind"]  = df.apply(lambda df:hh_kind_lab(df),axis = 1)

plt.figure(figsize= (8,6))
ax = sns.countplot(y = df["house_kind"],order= df["house_kind"].value_counts().index,
                  linewidth = 1 ,edgecolor = "w"*df["house_kind"].nunique(),
                  palette = "gist_ncar")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.title("House Ownership types")
plt.show()

# <a id='7'>7. Over Crowding </a>
## <a id='7-1'>7.1. Percentage of poverty level by rooms in house hold </a>

In [None]:
rms = pd.crosstab(df["rooms"],df["Target"]).apply(lambda r:r/r.sum()*100,axis =1)
ax = rms.plot(kind = "bar",stacked =True,figsize = (12,7),
              colors = sns.color_palette("Set1",5),
              linewidth = 1,edgecolor = "k"*df["rooms"].nunique())
ax.set_facecolor("k")
plt.xlabel("Total rooms in the household")
plt.ylabel("percentage")
plt.legend(loc = "best",prop = {"size" : 13})
plt.xticks(rotation = 0)
plt.grid(True,alpha = .1)
plt.title("Percentage of poverty level by rooms in house hold")
plt.show()

## <a id='7-2'>7.2.  Over Crowding by Poverty Categories </a>
* hacdor	  - 1 Overcrowding by bedrooms
* hacapo	  - 1 Overcrowding by rooms
* Over crowding has highest percent of overcrowding percentage compared to all categories.

In [None]:

#hacdor	 =1 Overcrowding by bedrooms
#hacapo	 =1 Overcrowding by rooms

df = df.rename(columns = {"hacdor":"over_crowding_bedroom" ,"hacapo":"overcrowding_rooms"} )

ov_bd = pd.crosstab(df["Target"],df["over_crowding_bedroom"]).apply(lambda r:r/r.sum()*100,axis = 1)
ax = ov_bd.plot(kind = "bar",stacked =True,figsize=(10,5),width=.3,
               linewidth = 1,edgecolor = "w"*length)
plt.xticks(rotation = 0)
plt.legend(loc = "best",labels = ["NO","YES"])
ax.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.ylabel("percentage")
plt.title("Over Crowding by bedrooms")

ov_r = pd.crosstab(df["Target"],df["overcrowding_rooms"]).apply(lambda r:r/r.sum()*100,axis = 1)
ax1  = ov_r.plot(kind = "bar",stacked =True,figsize=(10,5),width = .3,
                linewidth = 1,edgecolor = "w"*length)
plt.xticks(rotation = 0)
plt.legend(loc = "best",labels = ["NO","YES"])
ax1.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.ylabel("percentage")
plt.title("Over crowding by rooms")
plt.show()

## <a id='7-3'>7.3. Over crowding by poverty levels  </a>
* overcrowding	 - persons per room

In [None]:
df["SQBovercrowding"]
plt.figure(figsize=(13,7))
pal1 = ["#3399FF","#FF3300","#669900","#FF9966"]
for i,j,k in itertools.zip_longest(pov_lst,range(length),pal1):
    ax = sns.kdeplot(df[df["Target"] == i]["overcrowding"],
                     label = i,
                     shade=True,linewidth = 2,color = k)
    ax.set_facecolor("k")
    plt.axvline(df[df["Target"] == i]["overcrowding"].mean(),color = k ,
                label = "mean",linestyle = "dotted")
    plt.grid(True,alpha = .2)
    plt.title(" Over crowding by poverty levels")
    plt.legend(loc = "best" , prop = {"size" : 12})
    plt.xlabel("Over crowding ")

## <a id='7-4'>7.4. Summary stats on over crowding squared  by poverty levels </a>

In [None]:
ovr  = df.groupby("Target")["SQBovercrowding"].describe()[['mean', 'std']]
ax = ovr.plot(kind = "bar",figsize = (10,5),linewidth = 1,edgecolor = "w"*len(ovr))
plt.xticks(rotation = 0)
ax.set_facecolor("k")
plt.grid(True,alpha = .1)
plt.legend(loc = "best" , prop = {"size" : 12})
plt.title("Summary stats on over crowding squared  by poverty levels")
plt.ylabel("Over crowding squared")
plt.show()

# <a id='8'>8 . AGE </a>
## <a id='8-1'>8.1. Age Distribution by Poverty levels </a>
age	 -  Age in years

In [None]:
pal = ["#FF3366","#6633CC","#66FF00","#FF9900"]
plt.figure(figsize=(12,7))

for i,j,k in itertools.zip_longest(pov_lst,range(length),pal):
    ax = sns.kdeplot(df[df["Target"] == i]["age"],
                     label = i,
                     shade=True,linewidth = 2,color = k)
    ax.set_facecolor("k")
    plt.axvline(df[df["Target"] == i]["age"].mean(),color = k ,
                label = "mean",linestyle = "dashed")
    plt.grid(True,alpha = .2)
    plt.title("Age Distribution by Poverty levels")
    plt.legend(loc = "best" , prop = {"size" : 12})
    plt.xlabel("Age in years")

## <a id='8-2'>8.2.  Age distribution by gender </a>
* Average age of men is slightly lower than average females age.

In [None]:
plt.figure(figsize=(13,6))
ax = sns.distplot(df[df["male"] == 1]["age"],label = "MALE",color = "b")
ax = sns.distplot(df[df["male"] == 0]["age"],label = "FEMALE",color = "r")
plt.axvline(df[df["male"] == 1]["age"].mean(),linestyle = "dotted",
            label = "MALE AVG",color = "b")
plt.axvline(df[df["male"] == 0]["age"].mean(),linestyle = "dotted",
            label = "FEMALE AVG",color = "r")
ax.set_facecolor("lightgrey")
plt.legend(loc = "best" , prop = {"size" :  12})
plt.title(" Age distribution by gender ")
plt.show()

# <a id='9'>9 . EDUCATION </a>
## <a id='9-1'>9.1. Years of schooling by poverty levels </a>
* escolari	  - years of schooling

In [None]:
#escolari	 years of schooling

pal = ["#FF3366","#6633CC","#66FF00","#FF9900"]
pal1 = ["#3399FF","#FF3300","#669900","#FF9966"]
plt.figure(figsize=(13,12))

for i,j,k in itertools.zip_longest(pov_lst,range(length),pal1):
    plt.subplot(2,2,j+1)
    ax = sns.kdeplot(df[df["Target"] == i]["escolari"],
                     label = i,
                     shade=True,linewidth = 2,color = k)
    ax.set_facecolor("k")
    plt.axvline(df[df["Target"] == i]["escolari"].mean(),color = k ,
                label = "mean",linestyle = "dashed")
    plt.grid(True,alpha = .2)
    plt.title(i + " - Years of schooling ")
    plt.legend(loc = "best" , prop = {"size" : 12})
    plt.xlabel(" years of schooling")

## <a id='9-2'>9.2. Summary stats on years of schooling by poverty levels </a>
* Non vulnerable house holds have higher years of maximum and average years of educatuion
* Extreme and moderate house holds have low average years of education.

In [None]:
esc_des  = df.groupby("Target")["escolari"].describe()[['mean', 'std',"max"]]
ax = esc_des.plot(kind = "bar",figsize = (10,5),linewidth = 1,edgecolor = "w"*len(esc_des))
plt.xticks(rotation = 0)
ax.set_facecolor("k")
plt.grid(True,alpha = .1)
plt.legend(loc = "best" , prop = {"size" : 12})
plt.title("Summary stats on years of schooling by poverty levels")
plt.yticks(np.arange(0,22,3))
plt.ylabel("years")
plt.show()

## <a id='9-3'>9.3. Years behind schooling by poverty levels </a>
* rez_esc -  Years behind in school.
* This column contains 82% of misssing values.
* for the non vulnerable category as years increase by percentage tends to decrease and for extreme,moderate poverty levels %ge is increasing. 

In [None]:
x = df[df["rez_esc"].notnull()]
#rez_esc	 Years behind in school

rez = pd.crosstab(x["rez_esc"],x["Target"]).apply(lambda r:r/r.sum()*100,axis = 1)
ax = rez.plot(kind = "bar",figsize = (12,6),linewidth = 1,edgecolor = "w"*x["rez_esc"].nunique())
ax.set_facecolor("k")
plt.xticks(rotation = 0)
plt.xlabel("Years behind schooling")
plt.ylabel("percentage")
plt.grid(True,alpha = .2)
plt.title("Years behind schooling by poverty levels")
plt.show()

## <a id='9-4'>9.4. Years of education by gender </a>
* edjefe	- years of education of male head of household.
* edjefa	- years of education of female head of household.

In [None]:
#edjefe	 years of education of male head of household
#edjefa	 years of education of female head of household

ed_ls = ["edjefa","edjefe"]
ed_m = df["edjefe"].value_counts().reset_index()
ed_m.columns = ["years","count"]
ed_m["type"] = "male"
ed_f = df["edjefa"].value_counts().reset_index()
ed_f.columns = ["years","count"]
ed_f["type"] = "female"
ed_fm = pd.concat([ed_m,ed_f] ,axis = 0 ).sort_values(by = "count",ascending = False)

plt.figure(figsize= (13,7))
ax = sns.barplot("years" , "count" ,
                 data=ed_fm , hue = "type",
                 linewidth = 1,palette = "Set1",
                 edgecolor = "w"*len(ed_fm))
ax.set_facecolor("k")
plt.grid(True,alpha= .3)
plt.title("Years of education by gender")
plt.legend(prop = {"size" : 12})
plt.show()

## <a id='9-5'>9.5. average years of education for adults (18+) by poverty levels </a>
* meaneduc	- average years of education for adults (18+)

In [None]:
#meaneduc	average years of education for adults (18+)
df["meaneduc"]

plt.figure(figsize=(12,7))

for i,j,k in itertools.zip_longest(pov_lst,range(length),pal):
  #
    ax = sns.kdeplot(df[df["Target"] == i]["meaneduc"],
                     label = i,
                     shade=True,linewidth = 2,color = k)
    ax.set_facecolor("k")
    plt.axvline(df[df["Target"] == i]["meaneduc"].mean(),color = k ,
                label = "mean",linestyle = "dashed")
    plt.grid(True,alpha = .2)
    plt.title("Mean years of education by poverty levels ")
    plt.legend(loc = "best" , prop = {"size" : 12})
    plt.xlabel("Mean years of education")


## <a id='9-6'>9.6. Education Level Frequency distribtion </a>
* instlevel1	   =1 no level of education
* instlevel2	 =1 incomplete primary
* instlevel3	 =1 complete primary
* instlevel4	 =1 incomplete academic secondary level
* instlevel5	 =1 complete academic secondary level
* instlevel6	 =1 incomplete technical secondary level
* instlevel7	 =1 complete technical secondary level
* instlevel8	 =1 undergraduate and higher education
* instlevel9	 =1 postgraduate higher education

In [None]:
# instlevel1	 =1 no level of education
# instlevel2	 =1 incomplete primary
# instlevel3	 =1 complete primary
# instlevel4	 =1 incomplete academic secondary level
# instlevel5	 =1 complete academic secondary level
# instlevel6	 =1 incomplete technical secondary level
# instlevel7	 =1 complete technical secondary level
# instlevel8	 =1 undergraduate and higher education
# instlevel9	 =1 postgraduate higher education

inst_cols = df.columns[df.columns.str.contains("inst")]


def label_edu(df):
    if df["instlevel1"] == 1 :
        return "No Education"
    elif df["instlevel2"] == 1:
        return "Incomplete Primary"
    elif df["instlevel3"]  == 1 :
        return "Complete Primary"
    elif df["instlevel4"]  == 1 :
        return "Incomplete academic secondary level"
    elif df["instlevel5"]  == 1 :
        return "Complete academic secondary level"
    elif df["instlevel6"]  == 1 :
        return "Incomplete technical secondary level"
    elif df["instlevel7"]  == 1 :
        return "Complete technical secondary level"
    elif df["instlevel8"]  == 1 :
        return "undergraduate and higher education"
    elif df["instlevel9"]  == 1 :
        return "postgraduate higher education"

    
df["education_level"] = df.apply(lambda df:label_edu(df) ,axis =1)
plt.figure(figsize=( 10 ,10 ))
df["education_level"].value_counts().plot.pie(shadow = True, autopct = "%1.0f%%",
                                              wedgeprops = {"linewidth":1,"edgecolor":"white"},
                                              colors = sns.color_palette("husl"))
plt.title("Education levels")
plt.ylabel("")
plt.show()

## <a id='9-7'>9.7. Education levels by gender </a>

In [None]:
ax = pd.crosstab(df["education_level"],df["male"]).plot(kind = "barh",
                                                        figsize = (10,8),
                                                        linewidth = 1,
                                                        edgecolor = "w"*df["education_level"].nunique())
ax.set_facecolor("k")
plt.xlabel("count")
plt.grid(True,alpha = .2)
plt.title("Education levels by gender")
plt.legend(labels = {"male","female"},prop = {"size" :13})
plt.ylabel("")
plt.show()

## <a id='9-8'>9.8. Education levels by poverty category </a>

In [None]:
ed_pv = pd.crosstab(df["education_level"],df["Target"]).apply(lambda r:r/r.sum()*100,axis = 1)

ax = ed_pv.plot(kind = "barh",stacked =True ,figsize = (10,8),
                colors = sns.color_palette("Set1"),
               linewidth = 1,edgecolor = "w"*df["education_level"].nunique())
ax.set_facecolor("k")
plt.xlabel("percentage")
plt.grid(True,alpha = .2)
plt.title("Education levels by poverty category")
plt.legend(prop = {"size" :13})
plt.ylabel("")
plt.show()


# <a id='10'>10.  WALLS </a>

## <a id='10-1'>10.1. Predominant material on the outside wall </a>

In [None]:
# paredblolad	 =1 if predominant material on the outside wall is block or brick		
# paredzocalo	 "=1 if predominant material on the outside wall is socket (wood	  zinc or absbesto"	
# paredpreb	 =1 if predominant material on the outside wall is prefabricated or cement		
# pareddes	 =1 if predominant material on the outside wall is waste material		
# paredmad	 =1 if predominant material on the outside wall is wood		
# paredzinc	 =1 if predominant material on the outside wall is zink		
# paredfibras	 =1 if predominant material on the outside wall is natural fibers		
# paredother	 =1 if predominant material on the outside wall is other		
# epared1	 =1 if walls are bad
# epared2	 =1 if walls are regular
# epared3	 =1 if walls are good


wal_col = df.columns[df.columns.str.contains("pared")]

def wall_label(df):
    if df["paredblolad"] == 1 :
        return "Brick"
    if df["paredzocalo"] == 1 :
        return "zinc or asbestos"
    if df["paredpreb"] == 1 :
        return "cement"
    if df["pareddes"] == 1:
        return "waste material"
    if df["paredmad"] == 1:
        return "wood"
    if df["paredzinc"] == 1:
        return "zinc"
    if df["paredfibras"] == 1:
        return "natural fiber"
    if df["paredother"] == 1:
        return "other"
    
def wall_condition_label(df) :
    if df["epared1"] == 1:
        return "Bad"
    if df["epared2"] == 1:
        return "Regular"
    if df["epared3"] == 1:
        return "Good"

df["wall_material"]  = df.apply(lambda df:wall_label(df),axis =1)
df["wall_condition"] = df.apply(lambda df:wall_condition_label(df),axis =1)

plt.figure(figsize= (10,8))
ax  = sns.countplot(y = df["wall_material"],order = df["wall_material"].value_counts().index,
                    linewidth = 1,edgecolor = "w"*df["wall_material"].nunique(),
                    palette = "gist_ncar")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.title("Predominant material on the outside wall")
plt.show()


## <a id='10-2'>10.2. Wall condition by predominant material on the outside wall </a>

In [None]:
plt.figure(figsize= (12,6))
ax  = sns.countplot(df["wall_material"],order = df["wall_material"].value_counts().index,
                    hue = df["wall_condition"],
                    linewidth = 1,edgecolor = "w"*df["wall_material"].nunique(),
                    palette = "jet_r")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.title("Wall condition by predominant material on the outside wall")
plt.show()

## <a id='10-3'>10.3. Poverty  by wall condition </a>

In [None]:

plt.figure(figsize=(10,10))
for i,j in itertools.zip_longest(pov_lst,range(len(pov_lst))):
    plt.subplot(2,2,j+1)
    df[df["Target"] == i]["wall_condition"].value_counts().plot.pie(startangle = 120,
                                                                    shadow = True, autopct = "%1.0f%%",
                                                                    wedgeprops = {"linewidth":1,"edgecolor":"white"},
                                                                    colors = pal1)
    plt.ylabel("")
    plt.title(i,color ="b")

# <a id='11'>11 . FLOORING </a>

## <a id='11-1'>11.1 Predominant material on the floor </a>

In [None]:
# eviv1	 =1 if floor are bad
# eviv2	 =1 if floor are regular
# eviv3	 =1 if floor are good
#pisomoscer	 "=1 if predominant material on the floor is mosaic
# pisocemento	 =1 if predominant material on the floor is cement
# pisoother	 =1 if predominant material on the floor is other
# pisonatur	 =1 if predominant material on the floor is  natural material
# pisonotiene	 =1 if no floor at the household
# pisomadera	 =1 if predominant material on the floor is wood

ls = df.columns[df.columns.str.contains("piso")][:6].tolist()
cn = df.columns[df.columns.str.contains("eviv")].tolist()
floor_cols = ls+cn

x = df[floor_cols]

def floor_lab(df):
    if df["pisomoscer"] == 1 :
        return "Mosaic"
    elif df["pisocemento"] == 1 :
        return "Cement"
    elif df["pisoother"] == 1 :
        return "Other"
    elif df["pisonatur"] == 1 :
        return "Natural material"
    elif df["pisonotiene"] == 1 :
        return "No floor"
    elif df["pisomadera"] == 1 :
        return "Wood"


def floor_condition_lab(df) :
    if df["eviv1"] == 1 :
        return "Bad"
    elif df["eviv2"] == 1:
        return "Regular"
    elif df["eviv3"] == 1 :
        return "Good"
    
df["floor_material"]  = df.apply(lambda df:floor_lab(df),axis = 1)
df["floor_condition"] = df.apply(lambda df:floor_condition_lab(df) ,axis =1)

df["floor_material"]
plt.figure(figsize= (8,6))
ax  = sns.countplot(y = df["floor_material"],order = df["floor_material"].value_counts().index,
                    linewidth = 1,edgecolor = "w"*df["floor_material"].nunique(),
                    palette = "gist_ncar")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.title("Predominant material on the floor")
plt.show()

## <a id='11-2'>11.2. Floor condition by predominant material on the floor </a>

In [None]:
plt.figure(figsize= (12,6))
ax  = sns.countplot(df["floor_material"],order = df["floor_material"].value_counts().index,
                    hue = df["floor_condition"],
                    linewidth = 1,edgecolor = "w"*df["floor_material"].nunique(),
                    palette = "jet_r")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.title("Floor condition by predominant material on the floor")
plt.show()

## <a id='11-3'>11.3. poverty level by floor condition </a>

In [None]:

plt.figure(figsize=(10,10))
for i,j in itertools.zip_longest(pov_lst,range(len(pov_lst))):
    plt.subplot(2,2,j+1)
    df[df["Target"] == i]["floor_condition"].value_counts().plot.pie(startangle = 120,
                                                                    shadow = True, autopct = "%1.0f%%",
                                                                    wedgeprops = {"linewidth":1,"edgecolor":"white"},
                                                                    colors = pal1)
    plt.ylabel("")
    plt.title(i,color ="b")

# <a id='12'>12.CEILING </a>
## <a id='12-1'>12.1. Predominant material on the ceiling </a>

In [None]:
# techozinc	 =1 if predominant material on the roof is metal foil or zink
# techoentrepiso	 "=1 if predominant material on the roof is fiber cement
# techocane	 =1 if predominant material on the roof is natural fibers
# techootro	 =1 if predominant material on the roof is other
# cielorazo	 =1 if the house has ceiling
# etecho1	 =1 if roof are bad
# etecho2	 =1 if roof are regular
# etecho3	 =1 if roof are good


ceil_col = df.columns[df.columns.str.contains("techo")]

def ciel_lab(df) :
    if df["techozinc"] == 1 :
        return "Foil or Zinc"
    elif df["techoentrepiso"] == 1:
        return "Fiber cement"
    elif df["techocane"] == 1 :
        return "Natural Fibers"
    elif df["techootro"] == 1:
        return "Other"
def ciel_condition(df):
    if df["etecho1"] == 1:
        return "Bad"
    if df["etecho2"] == 1 :
        return "Regular"
    if df["etecho3"] == 1 :
        return "Good"

df["cieling_material"]  = df.apply(lambda df:ciel_lab(df),axis = 1)
df["cieling_condition"] = df.apply(lambda df:ciel_condition(df),axis=1)
df[['techozinc', 'techoentrepiso', 'techocane', 'techootro', 'etecho1',
       'etecho2', 'etecho3',"cieling_material","cieling_condition"]]
plt.figure(figsize= (6,6))
ax  = sns.countplot(y = df["cieling_material"],order = df["cieling_material"].value_counts().index,
                    linewidth = 1,edgecolor = "w"*df["cieling_material"].nunique(),
                    palette = "Set1")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.title("Predominant material on the ceiling")
plt.show()

## <a id='12-2'>12.2. Cieling condition by predominant material on the roof </a>


In [None]:
plt.figure(figsize= (12,6))
ax  = sns.countplot(df["cieling_material"],order = df["cieling_material"].value_counts().index,
                    hue = df["cieling_condition"],
                    linewidth = 1,edgecolor = "w"*df["cieling_material"].nunique(),
                    palette = "jet_r")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.title("Cieling condition by predominant material on the roof")
plt.show()

## <a id='12-3'>12.3.poverty level by ceiling condition </a>

In [None]:
plt.figure(figsize=(10,10))
for i,j in itertools.zip_longest(pov_lst,range(len(pov_lst))):
    plt.subplot(2,2,j+1)
    df[df["Target"] == i]["cieling_condition"].value_counts().plot.pie(startangle = 120,
                                                                    shadow = True, autopct = "%1.0f%%",
                                                                    wedgeprops = {"linewidth":1,"edgecolor":"white"},
                                                                    colors = pal1)
    plt.ylabel("")
    plt.title(i,color ="b")

# <a id='13'>13.Electricity & water provision </a>
## <a id='13-1'>13.1.Electricity Source in the dwelling </a>

In [None]:
# public	 "=1 electricity from CNFL	  ICE	  ESPH/JASEC"
# planpri	 =1 electricity from private plant		
# noelec	 =1 no electricity in the dwelling		
# coopele	 =1 electricity from cooperative		

def elec_lab(df) :
    if df["public"] == 1 :
        return "CNFL"
    if df["planpri"]== 1 :
        return "Private Plant"
    if df["noelec"] == 1 :
        return "No electricity"
    if df["coopele"] == 1 :
        return "Cooperative"
df["electricity_source"] = df.apply(lambda df:elec_lab(df) ,axis = 1)
df[["public","planpri","noelec","coopele","electricity_source"]]
plt.figure(figsize= (8,6))
ax  = sns.countplot(y = df["electricity_source"],order = df["electricity_source"].value_counts().index,
                    linewidth = 1,edgecolor = "w"*df["electricity_source"].nunique(),
                    palette = "gist_ncar")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)
plt.title("Electricity Source in the dwelling")
plt.show()

## <a id='13-2'>13.2.Water Source type in the dwelling </a>

In [None]:
#abastaguadentro	 =1 if water provision inside the dwelling
#abastaguafuera	 =1 if water provision outside the dwelling
#abastaguano	 =1 if no water provision

def water_lab(df):
    if df["abastaguadentro"] == 1 :
        return "Inside Dwelling"
    elif df["abastaguafuera"] == 1 :
        return "Outside Dwelling"
    elif df["abastaguano"] == 1 :
        return "No water provision"
df["water_type"] = df.apply(lambda df:water_lab(df),axis=1)

plt.figure(figsize= (8,8))
df["water_type"].value_counts().plot.pie(shadow = True, autopct = "%1.1f%%",
                                         wedgeprops = {"linewidth":1,"edgecolor":"white"},
                                         colors = pal1)
plt.ylabel("")
plt.title("Water Source type in the dwelling")
plt.show()

## <a id='13-3'>13.3.Energy Source for cooking in the dwelling </a>

In [None]:
# energcocinar1	 =1 no main source of energy used for cooking (no kitchen)
# energcocinar2	 =1 main source of energy used for cooking electricity
# energcocinar3	 =1 main source of energy used for cooking gas
# energcocinar4	 =1 main source of energy used for cooking wood charcoal

df.columns[df.columns.str.contains("energcocinar")]

def energy_lab(df):
    if df["energcocinar1"] == 1 :
        return "No source"
    elif df["energcocinar2"] == 1 :
        return "Electricity"
    elif df["energcocinar3"] == 1 :
        return "Cooking Gas"
    elif df["energcocinar4"] == 1 :
        return "Wood Charcoal"
df["energy_source"] = df.apply(lambda df:energy_lab(df),axis = 1)
plt.figure(figsize=(8,5))
ax  = sns.countplot(y = df["energy_source"],order = df["energy_source"].value_counts().index,
                    linewidth = 1,edgecolor = "w"*df["energy_source"].nunique(),
                    palette = "gist_ncar")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)

plt.title("Energy Source for cooking in the dwelling")
plt.show()

## <a id='13-4'>13.4.Percentage of poverty level by energy source </a>

In [None]:
en_s = pd.crosstab(df["Target"],df["energy_source"]).apply(lambda x:x/x.sum()*100,axis = 1)
ax =  en_s.plot(kind = "bar",figsize = (10,5),linewidth=1,edgecolor="w"*df["Target"].nunique()) 
plt.xticks(rotation = 0)
plt.ylabel("percentage")
ax.set_facecolor("k")
plt.title("Percentage of poverty level by energy source")
plt.show()

# <a id='14'>14.Sanitation </a>
## <a id='14-1'>14.1.Sanitation system type in the dwelling </a>

In [None]:
# elimbasu1	 =1 if rubbish disposal mainly by tanker truck
# elimbasu2	 =1 if rubbish disposal mainly by botan hollow or buried
# elimbasu3	 =1 if rubbish disposal mainly by burning
# elimbasu4	 =1 if rubbish disposal mainly by throwing in an unoccupied space
# elimbasu5	 "=1 if rubbish disposal mainly by throwing in river
# elimbasu6	 =1 if rubbish disposal mainly other

def san_lab(df):
    if df["sanitario1"] == 1 :
        return "No Sanitary"
    elif df["sanitario2"] == 1 :
        return "Cesspool"
    elif df["sanitario3"] == 1 :
        return "Black Hole"
    elif df["sanitario6"] == 1 :
        return "Other system"
def elim_lab(df):
    if df["elimbasu1"] == 1 :
        return "Tanker truck"
    if df["elimbasu2"] == 1 :
        return "Burried"
    if df["elimbasu3"] == 1 :
        return "Burning"
    if df["elimbasu4"] == 1 :
        return "Unoccupied Space"
    if df["elimbasu5"] == 1 :
        return "River"
    if df["elimbasu6"] == 1 :
        return "Other"
    
df["sanitation_type"] = df.apply(lambda df:san_lab(df),axis = 1)
df["elimination_system"] = df.apply(lambda df:elim_lab(df),axis = 1)

df["sanitation_type"].value_counts()
plt.figure(figsize=(8,5))
ax  = sns.countplot(y = df["sanitation_type"],order = df["sanitation_type"].value_counts().index,
                    linewidth = 1,edgecolor = "w"*df["sanitation_type"].nunique(),
                    palette = "gist_ncar")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)

plt.title("Sanitation system type in the dwelling")
plt.show()

## <a id='14-2'>14.2.Elimination type in the dwelling </a>

In [None]:
plt.figure(figsize=(8,5))
ax  = sns.countplot(y = df["elimination_system"],order = df["elimination_system"].value_counts().index,
                    linewidth = 1,edgecolor = "w"*df["elimination_system"].nunique(),
                    palette = "gist_ncar")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)

plt.title("Elimination type in the dwelling")
plt.show()

# <a id='15'>15.Civil status and relationships </a>
## <a id='15-1'>15.1.Civil status category by the householders </a>

In [None]:
# estadocivil1	 =1 if less than 10 years old
# estadocivil2	 =1 if free or coupled uunion
# estadocivil3	 =1 if married
# estadocivil4	 =1 if divorced
# estadocivil5	 =1 if separated
# estadocivil6	 =1 if widow/er
# estadocivil7	 =1 if single

civ_cols = df.columns[df.columns.str.contains("estadocivil")]

def civil_lab(df):
    if df["estadocivil1"] == 1 :
        return "less than 10 years"
    elif df["estadocivil2"] == 1 :
        return "Free or Coupled union"
    elif df["estadocivil3"] == 1 :
        return "Married"
    elif df["estadocivil4"] == 1 :
        return "Divorced"
    elif df["estadocivil5"] == 1 :
        return "Separated"
    elif df["estadocivil6"] == 1 :
        return "Widow"
    elif df["estadocivil7"] == 1 :
        return "Single"
    
df["civil_status"] = df.apply(lambda df:civil_lab(df),axis = 1)

plt.figure(figsize=(8,8))
ax  = sns.countplot(y = df["civil_status"],order = df["civil_status"].value_counts().index,
                    linewidth = 1,edgecolor = "w"*df["civil_status"].nunique(),
                    palette = "gist_ncar")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)

plt.title("Civil status category by the householders")
plt.show()

## <a id='15-2'>15.2.Relation status category by the householders </a>

In [None]:
# parentesco1	 =1 if household head
# parentesco2	 =1 if spouse/partner
# parentesco3	 =1 if son/doughter
# parentesco4	 =1 if stepson/doughter
# parentesco5	 =1 if son/doughter in law
# parentesco6	 =1 if grandson/doughter
# parentesco7	 =1 if mother/father
# parentesco8	 =1 if father/mother in law
# parentesco9	 =1 if brother/sister
# parentesco10	 =1 if brother/sister in law
# parentesco11	 =1 if other family member
# parentesco12	 =1 if other non family member

def relation_lab(df):
    if df["parentesco1"] == 1 :
        return "House head"
    elif df["parentesco2"] == 1 :
        return "Spouse/partner"
    elif df["parentesco3"] == 1 :
        return "Son/daughter"
    elif df["parentesco4"] == 1 :
        return "Stepson/daughter"
    elif df["parentesco5"] == 1 :
        return "Son/daughter in law"
    elif df["parentesco6"] == 1 :
        return "Grandson/daughter"
    elif df["parentesco7"] == 1 :
        return "Mother/Father"
    elif df["parentesco8"] == 1 :
        return "Mother/Father in law"
    elif df["parentesco9"] == 1 :
        return "Brother/Sister"
    elif df["parentesco10"] == 1 :
        return "Brother/Sister in law"
    elif df["parentesco11"] == 1 :
        return "other family member"
    elif df["parentesco12"] == 1 :
        return "other non family member"

df["relation_status"] = df.apply(lambda df:relation_lab(df),axis = 1) 
df["relation_status"]
plt.figure(figsize=(8,8))
ax  = sns.countplot(y = df["relation_status"],
                    order = df["relation_status"].value_counts().index,
                    linewidth = 1,edgecolor = "w"*df["relation_status"].nunique(),
                    palette = "gist_ncar")
ax.set_facecolor("k")
plt.grid(True,alpha = .2)

plt.title("Relation status category by the householders")
plt.show()

## Utilities by poverty levels

In [None]:
#refrig	 =1 if the household has refrigerator
#computer	 =1 if the household has notebook or desktop computer
#television	 =1 if the household has TV
#mobilephone	 =1 if mobile phone
#qmobilephone	 # of mobile phones

util_cols = ["refrig","computer","television","mobilephone"]
plt.figure(figsize=(13,12))
for i,j in itertools.zip_longest(util_cols,range(len(util_cols))):
    ax  = plt.subplot(2,2,j+1)
    ct  = pd.crosstab(df["Target"],df[i])
    ax1 = ct.plot(kind = "bar",ax=ax,linewidth = 1,
                  edgecolor = "w"*df["Target"].nunique())
    ax1.set_facecolor("k")
    plt.xticks(rotation = 0,fontsize = 9)
    plt.title(i)
    plt.xlabel("")
    plt.legend(prop = {"size" : 12})

# <a id='16'>16.Dependency(squared) </a>
## <a id='16-1'>16.1.Average squared dependency for poverty levels </a>
* calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)

In [None]:
# calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
ax = df.groupby("Target")["SQBdependency"].mean().plot(kind = "bar",linewidth = 1,figsize = (10,5),
                                                  edgecolor = "w"*df["Target"].nunique())
ax.set_facecolor("k")
plt.xticks(rotation = 0)
plt.title("Average squared dependency for poverty levels")
plt.grid(True,alpha = .2)
plt.show()

## <a id='16-2'>16.2.Average squared dependency by education levels  </a>

In [None]:
ax = df.groupby("education_level")["SQBdependency"].mean().plot(kind = "barh",linewidth = 1,figsize = (8,8),
                                                                edgecolor = "w"*df["education_level"].nunique(),
                                                                colors = sns.color_palette("prism",9)
                                                               )
ax.set_facecolor("k")
plt.xticks(rotation = 0)
plt.title("Average squared dependency by education levels")
plt.grid(True,alpha = .2)
plt.show()

# <a id='17'>17.Zones  </a>
## <a id='17-1'>17.1.  zone percentage in training data  </a>

In [None]:
#area1	 =1 zona urbana
#area2	 =2 zona rural

def area_lab(df) :
    if df["area1"] == 1 :
        return "URBAN"
    elif df["area2"] == 1 :
        return "RURAL"
df["zone_type"] = df.apply(lambda df:area_lab(df),axis = 1 )
df[["area1","area2","zone_type"]]
plt.figure(figsize= (13,6))
plt.subplot(121)
df["zone_type"].value_counts().plot.pie(autopct = "%1.0f%%",
                                        wedgeprops = {"linewidth":2,"edgecolor":"white"},
                                        colors = pal1)
plt.title("zone percentage in training data")
plt.ylabel("")

plt.subplot(122)
ax = sns.countplot(y = df["zone_type"],linewidth = 2 ,
                   edgecolor = "k"*2,palette = pal1)

for i,j in enumerate(df["zone_type"].value_counts()):
    ax.text(.7,i,j,fontsize = 14)
plt.title("Zone count in training data")
plt.show()

## <a id='17-2'>17.2.  urban and rural percentage by poverty levels  </a>

In [None]:
pov_lst
pal2 = ["#3399FF","#FF3300","#669900","#FF9966","#FF9900","#999933"]
plt.figure(figsize=(10,10))
for i,j in itertools.zip_longest(pov_lst,range(len(pov_lst))):
    plt.subplot(2,2,j+1)
    df[df["Target"] == i]["zone_type"].value_counts().plot.pie(shadow = True,
                                                               autopct = "%1.1f%%",
                                                               wedgeprops = {"linewidth":2,"edgecolor":"k"},
                                                               colors = pal2)
    plt.title(i,color = "b")
    plt.ylabel("")

 # <a id='18'>18 . Socio-economic regions of Costa Rica </a>
* Costa Rica is also divided into six regions.  the provinces are the primary units of local government. The regions seem to be used for certain statistical reports, such as public health. Some region names derive from precolumbian ethnic groups.

In [None]:
img = np.array(Image.open(r"../input/regions-img/resumen-estudios-sociales-de-regiones-de-costa-rica-1-638.jpg"))
fig = plt.figure(figsize=(12,12))
plt.imshow(img,interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
#lugar1	 =1 region Central
#lugar2	 =1 region Chorotega
#lugar3	 =1 region PacÃƒÂ­fico central
#lugar4	 =1 region Brunca
#lugar5	 =1 region Huetar AtlÃƒÂ¡ntica
#lugar6	 =1 region Huetar Norte

pal2 = ["#3399FF","#FF3300","#669900","#FF9966","#FF9900","#999933"]

reg = df[df.columns[df.columns.str.contains("lugar")]].stack().reset_index()[["level_1",0]]
reg = reg[reg[0] == 1]
reg = reg["level_1"].value_counts().reset_index()
reg["index"] = reg["index"].map({"lugar1":"Central","lugar2":"Chorotega",
                                 "lugar3":"Pacifico Central","lugar4":"Brunca",
                                "lugar5" : "Huetar Atlantica" , "lugar6" :"Huetar Norte"})

plt.figure(figsize=(7,16))
plt.subplot(211)
plt.pie(reg["level_1"],labels=reg["index"],autopct = "%1.0f%%",
        wedgeprops = {"linewidth":2,"edgecolor":"white"},
        colors = pal2)
plt.title("Region percentage in training data")

plt.subplot(212)
ax = sns.barplot(y = reg["index"], x = reg["level_1"],
                linewidth = 1,edgecolor ="k"*len(reg),
                palette = pal2)
plt.xlabel("count")
plt.ylabel("region")
plt.grid(True,alpha = .1)
for i,j in enumerate(reg["level_1"].values):
    ax.text(.7,i,j,fontsize = 12)
plt.title("Region count in training data")
plt.show()

In [None]:
#Extracting region from region variables
def label(df):
    if df["lugar1"] == 1 :
        return "CENTRAL"
    elif df["lugar2"] == 1 :
        return "CHOROTEGA"
    elif df["lugar3"] == 1 :
        return "PACIFICO CENTRAL"
    elif df["lugar4"] == 1:
        return "BRUNCA"
    elif df["lugar5"] == 1:
        return "HUETAR ATLANTICA"
    elif df["lugar6"] == 1:
        return "HUETAR NORTE"
df["region"] = df.apply(lambda df:label(df),axis =1)

##  <a id='18-1'>18.1. Poverty level percentage by Region </a>

In [None]:
reg_pov = df.groupby("region")["Target"].value_counts().to_frame()
reg_pov.columns = ["count"]
reg_pov = reg_pov.reset_index()

reg_lst = df["region"].unique()
length1 = len(reg_lst)
lab = ['non vulnerable', 'moderate', 'vulnerable','extreme']

plt.figure(figsize=(12,19))
for i,j in itertools.zip_longest(reg_lst,range(length1)):
    plt.subplot(3,2,j+1)
    plt.pie(reg_pov[reg_pov["region"] == i]["count"],
            shadow = True,
            labels = lab,
            autopct = "%1.0f%%",
            wedgeprops = {"linewidth":2,"edgecolor":"white"},
            colors = pal2)
    plt.title(i,color = "b")
    circ = plt.Circle((0,0),.7,color = "white")
    plt.gca().add_artist(circ)

##  <a id='18-2'>18.2. Average monthly rent payment by regions </a>
* Central and Chorotega regions have highest average monthly rent amount

In [None]:
pal2 = ["#3399FF","#FF3300","#669900","#FF9966","#FF9900","#999933"]
reg_lst = df["region"].unique()
month_rent_val = df[df["v2a1"].notnull()]
reg_py = month_rent_val.groupby("region")["v2a1"].mean()
ax = reg_py.plot(kind = "bar",figsize = (12,6),colors = pal2,
                 linewidth = 1,edgecolor = "w"*len(reg_lst))
plt.xticks(rotation = 0)
ax.set_facecolor("k")
plt.grid(True,alpha = .1)
plt.title("Average monthly rent payment by regions")
plt.ylabel("Mean monthly rent payment")
plt.show()

##   <a id='18-3'>18.3. overcrowding  by regions </a>
* 'PACIFICO CENTRAL', and  'HUETAR ATLANTICA', are highly overcrowded  on average

In [None]:
reg_lst = df["region"].unique()

ocr = df.groupby("region")["SQBovercrowding"].mean()
ax = ocr.plot(kind = "bar",figsize = (12,6),colors = pal2,
                 linewidth = 1,edgecolor = "w"*len(reg_lst))
plt.xticks(rotation = 0)
ax.set_facecolor("k")
plt.grid(True,alpha = .1)
plt.title("Squared overcrowding  by regions")
plt.ylabel("Squared overcrowding ")
plt.show()

## <a id='18-4'>18.4.  Average Squared dependency  by regions  </a>

In [None]:
reg_dep = df.groupby("region")["SQBdependency"].mean()
ax = reg_dep.plot(kind = "bar",figsize = (12,6),colors = pal2,
                 linewidth = 1,edgecolor = "w"*len(reg_lst))
plt.xticks(rotation = 0)
ax.set_facecolor("k")
plt.grid(True,alpha = .1)
plt.title("Squared dependency  by regions")
plt.ylabel("Squared overcrowding ")
plt.show()

## <a id='18-5'>18.5. Rural and urban percentage by regions  </a>

In [None]:
rg_zn = pd.crosstab(df["region"],df["zone_type"]).apply(lambda r:r/r.sum()*100,axis =1)
ax = rg_zn.plot(kind = "bar",figsize = (12,7),linewidth=1,
                edgecolor="w"*df["region"].nunique(),colors= pal1) 
plt.xticks(rotation = 0)
plt.ylabel("percentage")
ax.set_facecolor("k")
plt.grid(True ,alpha = .1)
plt.title("Rural and urban percentage by regions")
plt.show()

In [None]:
#concat train and test data
data = pd.concat([train,test],axis = 0)
data

data.isnull().sum()[data.isnull().sum() > 0]

#dropping missing value cols greater than 30% and object dtype cols and using squared cols
miss_cols =  ['rez_esc', 'v18q1', 'v2a1','Id', 'dependency', 'edjefa', 'edjefe', 'idhogar']
data = data.drop(miss_cols , axis = 1 )


#extracting eductaion level
def label_edu(data):
    if data["instlevel1"] == 1 :
        return "No Education"
    elif data["instlevel2"] == 1:
        return "Incomplete Primary"
    elif data["instlevel3"]  == 1 :
        return "Complete Primary"
    elif data["instlevel4"]  == 1 :
        return "Incomplete academic secondary level"
    elif data["instlevel5"]  == 1 :
        return "Complete academic secondary level"
    elif data["instlevel6"]  == 1 :
        return "Incomplete technical secondary level"
    elif data["instlevel7"]  == 1 :
        return "Complete technical secondary level"
    elif data["instlevel8"]  == 1 :
        return "undergraduate and higher education"
    elif data["instlevel9"]  == 1 :
        return "postgraduate higher education"

data["edu_level"] = data.apply(lambda data:label_edu(data) ,axis = 1)

#Filling missing values with grouped value of education level
ml = ["meaneduc","SQBmeaned"]

for i in ml :
    data[i] = data[i].fillna(data.groupby('edu_level')[i].transform("mean"))

#droping column
data = data.drop("edu_level",axis = 1)

#Splitting data

train_new = data[data["Target"].notnull()]
test_new  = data[data["Target"].isnull()]
    
train_X = train_new[[i for i in train_new.columns if i not in "Target"]] 
train_Y = train_new[["Target"]]
    
test_X  = test_new[[i for i in test_new.columns if i not in "Target"]]

In [None]:
#algorithm = Algorithm for classsification
#Columns   = columns for data
#kind = "features" for tree and ensembling models , coefficient for logit

def model(algorithm,columns,kind) : 
    
    algorithm.fit(train_X[columns],train_Y)
    predict  = algorithm.predict(test_X[columns])
    predict  = pd.DataFrame(predict)
    predict  = predict.rename(columns = {0 : "Target"})
    identity = test[["Id"]]
    predictions = pd.merge(identity,predict,left_index = True,right_index= True ,how = "left")
    predictions["Target"] = predictions["Target"].astype(int)
    
    print ("Algorithm :  " ,algorithm)
    
    features = pd.DataFrame(columns)
    if   kind == "features" :
        importances = pd.DataFrame(algorithm.feature_importances_)
    elif kind == "coefficients" :
        importances = pd.DataFrame(algorithm.coef_)
    feature_importance = pd.merge(features,importances,left_index= True,right_index= True,how = "left")
    feature_importance = feature_importance.rename(columns = {"0_x" : "features" ,"0_y" : "Coefficients"})
    feature_importance = feature_importance.sort_values("Coefficients",ascending=False)

    trace = go.Bar(x = feature_importance["features"],
                   y = feature_importance["Coefficients"],
                   marker = dict(color = feature_importance["Coefficients"],
                                 line  = dict(width = .3,color = "black"),
                                 colorscale = "Picnic",
                                ),
                  )
    layout = go.Layout(dict(title = "Feature Importances for Model",
                            yaxis = dict(title = "Cofficients",
                                         showgrid = True,
                                        ),
                            xaxis = dict(showgrid = True),
                           ),
                      )
    data  = [trace]
    fig   = go.Figure(data = data,layout=layout)
    py.iplot(fig)
    
    return (predictions.head().style.set_caption("predictions"))
    


In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
cols = train_X.columns
model(rfc,cols,"features")

In [None]:
from xgboost import XGBClassifier
xgbc = XGBClassifier()
model(xgbc,train_X.columns,"features")