In [None]:
Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets:
1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).
The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).


Attribute Information:

Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')



In [None]:
1 . how many campaign available in this dataset . 
2 . how many users do we have with housing and personal loan . 
3 . how many person do we have whose age is 60+ . 
4 . in which month we have trageted most of the customer . 
5 . which mode of call is giving you more result
6 . how many enterpeures do we have in this list  
7 . how many customers do we have with negative balance 
8 . prepare a group of data based on education level . 

In [42]:
import pandas as pd

In [43]:
pd.read_csv('d://DS/datasets/pandas/bank/bank.csv', sep=';')

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no


In [44]:
customers = pd.read_csv('d://DS/datasets/pandas/bank/bank.csv', sep=';')

In [6]:
# How many campaigns available in this dataset?
customers['campaign'].nunique()

32

In [9]:
len(set(customers['campaign'])) # Solution using set

32

In [25]:
# How many users do we have with housing and personal loan in this dataset?
len(customers[(customers['housing'] == "yes") & (customers['loan'] == "yes") ])

406

In [28]:
# How many person do we have whose age is 60+ ?
len(customers[customers['age'] > 60])

127

In [41]:
# in which month we have trageted most of the customer .
customers.groupby('month').campaign.sum().

month
apr     547
aug    2489
dec      37
feb     500
jan     266
jul    2608
jun    1684
mar     131
may    3410
nov     758
oct     115
sep      85
Name: campaign, dtype: int64

In [63]:
customers['month'].value_counts()

may    1398
jul     706
aug     633
jun     531
nov     389
apr     293
feb     222
jan     148
oct      80
sep      52
mar      49
dec      20
Name: month, dtype: int64

In [67]:
# which mode of call is giving you more result
cust_groupby = customers.groupby(['contact', 'y']).count()
cust_groupby.loc['cellular']

Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,day,month,duration,campaign,pdays,previous,poutcome
y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
no,2480,2480,2480,2480,2480,2480,2480,2480,2480,2480,2480,2480,2480,2480,2480
yes,416,416,416,416,416,416,416,416,416,416,416,416,416,416,416


In [68]:
cust_groupby.loc['cellular','yes']

age          416
job          416
marital      416
education    416
default      416
balance      416
housing      416
loan         416
day          416
month        416
duration     416
campaign     416
pdays        416
previous     416
poutcome     416
Name: (cellular, yes), dtype: int64

In [50]:
# how many enterpeures do we have in this list  
customers['job'].value_counts().entrepreneur

168

In [51]:
# how many customers do we have with negative balance 
len(customers[(customers['balance'] < 0)])

366

In [71]:
# prepare a group of data based on education level . 
customers.groupby('education').count()

Unnamed: 0_level_0,age,job,marital,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
primary,678,678,678,678,678,678,678,678,678,678,678,678,678,678,678,678
secondary,2306,2306,2306,2306,2306,2306,2306,2306,2306,2306,2306,2306,2306,2306,2306,2306
tertiary,1350,1350,1350,1350,1350,1350,1350,1350,1350,1350,1350,1350,1350,1350,1350,1350
unknown,187,187,187,187,187,187,187,187,187,187,187,187,187,187,187,187


In [76]:
customers.groupby('education').primary

AttributeError: 'DataFrameGroupBy' object has no attribute 'primary'

In [74]:
lst = []
for i in customers['education'].unique() :
    lst.append(customers[customers['education'] == i])

In [75]:
lst[1]

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
7,39,technician,married,secondary,no,147,yes,no,cellular,6,may,151,2,-1,0,unknown,no
10,39,services,married,secondary,no,9374,yes,no,unknown,20,may,273,1,-1,0,unknown,no
11,43,admin.,married,secondary,no,264,yes,no,cellular,17,apr,113,2,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4514,38,blue-collar,married,secondary,no,1205,yes,no,cellular,20,apr,45,4,153,1,failure,no
4515,32,services,single,secondary,no,473,yes,no,cellular,7,jul,624,5,-1,0,unknown,no
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no


In [None]:
ineuron github:
Read all resume from iNeuron github in pdf, word format and try to create a dataframe with resume name as index 
and in column i expect email id, linkedin id, git id, skills as a result


name | email | linkedin | git | skills

name/file name

Hint: 3 RegEx - 1 each for email, linkedin and git