# Context: 
# I. Business Analysis

This marketing bank dataset provides details on the bank's most recent marketing campaign. My goal is to analyze this dataset to detect patterns that will help me draw conclusions about how to improve the next marketing campain and to determine the best techniques to use.

# I. Business Analysis

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt

In [None]:
df=pd.read_csv('../input/bank-marketing-dataset/bank.csv')
df.head()

In [None]:
df.info()
df.shape

In [None]:
df.describe()

The average age of clients is about 41 years old, the highest is 95 years old, and the lowest is 18 years old.

The average customer banlance is 1528, but the standard deviation is very large, indicating that the distribution of this data is very scattered.


# Data cleaning and filtering

In [None]:
df.isnull().sum()

I find that there is no missing data in this data, so there is no need to interpolate the data. 

# Let's first examine how a customer's balance or bank deposit relates to information about their employment, education, marital status, housing, and loan status,age,...

# 1. Job

In [None]:
plt.figure(figsize=(20,12))
plt.subplot(211)
sns.countplot(x = 'job', data = df, order = df['job'].value_counts().index)
plt.subplot(212)
sns.boxplot(x='job',y='age',data=df)

# Insight 1
- Management is the most common job type;
- Retired people are older than other jobs and students are the youngest;
- **Retired people have the most balance。**


In [None]:
data['percent']=1 #添加新的一列全为1的值，
data3=(data.groupby(['job','deposit']).percent.count()/data.groupby(['job']).percent.count()).to_frame().reset_index()
data4=(data[data.poutcome!='unknown'].groupby(['job','poutcome']).percent.count()/data.groupby(['job']).percent.count()).to_frame().reset_index()

plt.subplot(211)
sns.barplot(x='job',y='percent',data=data3,hue='deposit')
plt.subplot(212)
sns.barplot(x='job',y='percent',data=data4,hue='poutcome')


# Insight 2 
- **Students and retirees are more likely to have deposits** 
- Blue collar workers, entrepreneurs, service providers and technicians are not easy to sell bank's service successfully.

# 2. Education 

In [None]:
plt.figure(figsize=(8,16))
group = df.groupby('education')
med_balance = group.aggregate({'balance':np.median}).sort_values('balance', ascending = False)
print(med_balance)
sns.boxplot(x = 'education', y = 'balance', data = df,order = med_balance.index, color = 'steelblue')

# Insight
It can be seen that **the higher the education level, the more tend to have deposit**, and the greater proportion in the banking business.

# 3.  Marital

In [None]:
plt.figure(figsize=(15,8))
sns.countplot('marital', data = df)

In [None]:
plt.figure(figsize=(20,6))
plt.subplot(311)
plt.title('Price Distrubution by Martial Status',fontsize=20)
sns.distplot(df[df.marital=='married'].balance)
plt.ylabel('married')
plt.subplot(312)
sns.distplot(df[df.marital=='divorced'].balance)
plt.ylabel('divorced')
plt.subplot(313)
sns.distplot(df[df.marital=='single'].balance)
plt.ylabel('single')

# Insight
I find out that the marital status has no correlation with the balance, and the distribution difference is not big. which is basically between 0 and 10000, but **the married people will have a higher value in the balance**


# 4.  Housing and Loan

In [None]:
plt.rcParams['figure.figsize']=(20,10)
plt.subplot(121)
sns.stripplot(x='housing',y='balance',data=df)
plt.subplot(122)
sns.stripplot(x='loan',y='balance',data=df)

# Insight: 
With or without housing loans and personal loans will greatly affect the balance, and **people without housing loans and personal loans will have more banlance.**


# 5. Customer's Age 

In [None]:
data=df
data['age_status']=data['age']
def agerank(age):
    if age<20:
        age_status='teen'
    elif age>=20 and age<30:
        age_status='young'
    elif age>=30 and age<40:
        age_status='mid'
    elif age>=40 and age<60:
        age_status='mid_old'
    else:age_status='old'
    return age_status
data.age_status=data.age_status.transform(lambda x:agerank(x))
data1=(data.groupby(['age_status','deposit']).age.count()/data.groupby(['age_status']).age.count()).to_frame().reset_index() 
sns.barplot(x='age_status',y='age',data=data1,hue='deposit')
plt.ylabel('(%)')
data1

# Insight 1
It can be seen that people after the age of 20 are happy to have deposit, while the younger people under 20 are more likely to have no deposit.


In [None]:

data2=(data[data.poutcome!='unknown'].groupby(['age_status','poutcome']).age.count()/data.groupby(['age_status']).age.count()).to_frame().reset_index() 
sns.barplot(x='age_status',y='age',data=data2,hue='poutcome')
plt.ylabel('(%)')
data2

# Insight 2
It can be seen that the marketing results for the elderly (over 60 years old) and young people (under 20 years old) marketing success rate is higher, even in young people will not fail. But with the growth of age, until the middle-aged (30-40 years old) stage, the marketing failure rate will rise, after that, with the further growth of age, the marketing failure rate will start to decrease.

**To sum up, the key marketing targets of banks should be those under 20 years old and over 60 years old.But the 20-to-60-year-old group is the main customer group of the bank, so their feelings also need to be considered.**


# II. Bank's Marketing Activity Analysis 

In [None]:
import datetime
# date=bank.pdays
now=datetime.datetime.today()
bank_date=df
bank_date['compain_date']=bank_date.pdays.transform(lambda x:now-datetime.timedelta(days=x))
bank_date['month']=bank_date['compain_date'].transform(lambda x:x.strftime('%m'))
plt.bar(bank_date['month'].value_counts().index,bank_date['month'].value_counts())
plt.xlabel('month')

In [None]:
data=bank_date.groupby(['month','poutcome']).count().reset_index()
sns.barplot(x='month',y='age',data=data,hue='poutcome')

# Insight
It can be seen from the above two pictures that **the number of marketing activities in July is the most.** However, since many marketing activities in July are actually of unkown's, it is necessary to exclude unkown from drawing again.


In [None]:
sns.barplot(x='month',y='age',data=data[data['poutcome']!='unknown'],hue='poutcome')

Redrawing makes it clearer to see the correlation between months and the number of marketing activities, as well as success stories. So, it can be concluded that:
**Marketing activities are mainly concentrated in January, April and August;**

> # Conclusion: 



1. **The highest deposit rate belongs to the student and elderly groups.** On the other hand, it is not easy to successfully market banking services to blue-collar workers, business professionals, service employees, and technicians.
2. **Education level is related to having a deposit.** Higher education levels are often associated with higher deposit rates.
3. **Marital status is not strongly related to the account balance of customers.** However, married individuals tend to have higher account balances compared to single or divorced individuals.
4. **Having a mortgage or personal loan significantly affects the account balance**. People without mortgages and personal loans tend to have higher balances.
5. **Age of the customer also has a correlation with having a deposit.** People above the age of 20 tend to have deposits, while younger individuals under 20 may not have deposits.
6. **Bank's marketing activities primarily focus on January, April, and August.** July has the highest number of marketing activities, but many activities in this month do not yield results.



> # My recommendations for the bank's marketing activities:

1. **Focus on reaching and advising customers in the age group of 20 to 60,**  as they are the primary customer segment of the bank.

2. **Create specialized marketing campaigns targeting students and the elderly**, as they tend to have deposits and are more successful in purchasing services.

3. **Research and develop attractive product and service packages** for the blue-collar, business professionals, service employees, and technicians customer segments.