# 로지스틱 회귀 연습문제

### 문제 1

피마 인디언 당뇨병 발병여부를 예측할 수 있는 분석 모델을 구현하기 위해 아래와 같은 항목들을 조사하였다. 분석하라.

| 변수 | 설명 |
|--|--|
| Pregnancies | 임신횟수 |
| Glucose | 포도당 부하 검사 수치 |
| BloodPressure | 혈압 |
| SkinThickness | 팔 삼두근 뒤쪽의 피하지방 측정값 |
| Insulin | 혈청 인슐린 |
| BMI | 체질량 지수 |
| DiabetesPedigreeFunction | 당뇨 내력 가중치 값 |
| Age | 나이 |
| Outcome | 당뇨여부(0 또는 1) |

> https://data.hossam.kr/E05/indian_diabetes.xlsx

단, 모든 독립변수는 명목형 변수를 포함하지 않으며 정규분포를 만족한다고 가정한다. (더미변수 만들 필요없음)

In [10]:
from pandas import read_excel, DataFrame, merge
from matplotlib import pyplot as plt
import seaborn as sb
import numpy as np
from patsy import dmatrix
import sys
import os


sys.path.append(os.path.dirname(os.path.dirname(os.getcwd())))
from helper import my_logit, scalling

In [2]:
plt.rcParams['font.family'] = "Malgun Gothic"
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (20,5)
plt.rcParams['axes.unicode_minus'] = False

In [3]:
df=read_excel("https://data.hossam.kr/E05/indian_diabetes.xlsx")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# 데이터 표준화
df_tmp = df.drop('Outcome',axis=1)
std_df = scalling(df_tmp)
std_df['Outcome'] = df['Outcome']
std_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.639947,0.848324,0.149641,0.90727,-0.692891,0.204013,0.468492,1.425995,1
1,-0.844885,-1.123396,-0.160546,0.530902,-0.692891,-0.684422,-0.365061,-0.190672,0
2,1.23388,1.943724,-0.263941,-1.288212,-0.692891,-1.103255,0.604397,-0.105584,1
3,-0.844885,-0.998208,-0.160546,0.154533,0.123302,-0.494043,-0.920763,-1.041549,0
4,-1.141852,0.504055,-1.504687,0.90727,0.765836,1.409746,5.484909,-0.020496,1


In [5]:
# 로지스틱 회귀분석
logit_result = my_logit(std_df,y='Outcome',x=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age'])
print(logit_result.summary)

Optimization terminated successfully.
         Current function value: 0.470993
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                Outcome   No. Observations:                  768
Model:                          Logit   Df Residuals:                      759
Method:                           MLE   Df Model:                            8
Date:                Mon, 31 Jul 2023   Pseudo R-squ.:                  0.2718
Time:                        18:24:41   Log-Likelihood:                -361.72
converged:                       True   LL-Null:                       -496.74
Covariance Type:            nonrobust   LLR p-value:                 9.652e-54
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                   -0.8711      0.097     -8.986      0.000      -1.061

In [6]:
logit_result.cmdf

Unnamed: 0,Negative,Positive
True,445,156
False,112,55


In [7]:
logit_result.odds_rate_df

Unnamed: 0,odds_rate
Intercept,0.41849
Pregnancies,1.514071
Glucose,3.075735
BloodPressure,0.77323
SkinThickness,1.009916
Insulin,0.871755
BMI,2.027404
DiabetesPedigreeFunction,1.367468
Age,1.190947


In [8]:
logit_result.prs

0.27180966859224587

In [9]:
logit_result.result_df

Unnamed: 0,설명력(Pseudo-Rsqe),정확도(Accuracy),정밀도(Precision),"재현율(Recall, TPR)","위양성율(Fallout, FPR)","특이성(Specificity, TNR)",RAS,f1_score
0,0.27181,0.782552,0.739336,0.58209,0.11,0.89,0.736045,0.651357


## 문제 2

다음의 데이터는 타이타닉 탑승객 명단 데이터이다. 적절한 데이터 전처리와 정제를 수행한 후 분석하라.

| 변수명 | 설명 |
|---|---|
| PassengerId | 탑승객의 ID(인덱스와 같은 개념) |
| Survived | 생존유무(0은 사망 1은 생존) |
| Pclass | 객실의 등급 |
| Name | 이름 |
| Sex |성별 |
| SibSp | 동승한 형제 혹은 배우자의 수 |
| Parch | 동승한 자녀 혹은 부모의 수 |
| Ticket | 티켓번호 |
| Fare | 요금 |
| Cabin | 선실 |
| Embarked | 탑승지 (C = Cherbourg, Q = Queenstown, S = Southampton) |

> https://data.hossam.kr/E05/titanic.xlsx

단, 모든 독립변수는 정규분포를 만족한다고 가정한다.

In [11]:
df1 = read_excel("https://data.hossam.kr/E05/titanic.xlsx")
df1.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
