### 🧠 **머신러닝 실습 - 회귀**

#### 📊 **머신러닝 절차**
1. 문제 정의 (문제를 읽고 파악)
2. 필요한 라이브러리 및 데이터 불러오기 (import)
3. EDA(탐색적 데이터 분석)
    - 데이터 샘플/크기(자료형, 통계량(수치/범주)), 결측치 확인 등
4. 데이터 전처리 
    - 결측치 및 이상치 처리, 인코딩, 스케일링 등
5. 검증 데이터 나누기
6. 모델 학습 및 평가
7. 예측 및 결과 파일 생성

---
#### **1. 문제 정의**

**[문제]**
- 데이터 : 아울렛 매장의 제품 판매 데이터
- 예측할 값(target) : 각 제품의 판매 금액
    - `Item_Outlet_Sales`(판매금액) 컬럼
- 평가 기준 : RMSE
- 제출 파일 : 예측값만 'result2.csv' 파일로 생성함. (컬럼명: pred, 1개)

---


#### **2. 필요한 라이브러리 및 데이터 불러오기**

In [1]:
import pandas as pd

train = pd.read_csv('train2.csv')
test = pd.read_csv('test2.csv')

---
#### **3. EDA (탐색적 데이터 분석)**

In [2]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6818 entries, 0 to 6817
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            6818 non-null   object 
 1   Item_Weight                5656 non-null   float64
 2   Item_Fat_Content           6818 non-null   object 
 3   Item_Visibility            6818 non-null   float64
 4   Item_Type                  6818 non-null   object 
 5   Item_MRP                   6818 non-null   float64
 6   Outlet_Identifier          6818 non-null   object 
 7   Outlet_Establishment_Year  6818 non-null   int64  
 8   Outlet_Size                4878 non-null   object 
 9   Outlet_Location_Type       6818 non-null   object 
 10  Outlet_Type                6818 non-null   object 
 11  Item_Outlet_Sales          6818 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 639.3+ KB


In [3]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705 entries, 0 to 1704
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            1705 non-null   object 
 1   Item_Weight                1404 non-null   float64
 2   Item_Fat_Content           1705 non-null   object 
 3   Item_Visibility            1705 non-null   float64
 4   Item_Type                  1705 non-null   object 
 5   Item_MRP                   1705 non-null   float64
 6   Outlet_Identifier          1705 non-null   object 
 7   Outlet_Establishment_Year  1705 non-null   int64  
 8   Outlet_Size                1235 non-null   object 
 9   Outlet_Location_Type       1705 non-null   object 
 10  Outlet_Type                1705 non-null   object 
dtypes: float64(3), int64(1), object(7)
memory usage: 146.7+ KB


In [4]:
print(train.shape)
print(test.shape)

(6818, 12)
(1705, 11)


In [5]:
print(train.isnull().sum())
print(test.isnull().sum())

Item_Identifier                 0
Item_Weight                  1162
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1940
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64
Item_Identifier                0
Item_Weight                  301
Item_Fat_Content               0
Item_Visibility                0
Item_Type                      0
Item_MRP                       0
Outlet_Identifier              0
Outlet_Establishment_Year      0
Outlet_Size                  470
Outlet_Location_Type           0
Outlet_Type                    0
dtype: int64


In [6]:
train.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,5656.0,6818.0,6818.0,6818.0,6818.0
mean,12.872703,0.066121,140.419533,1997.88589,2190.941459
std,4.651034,0.051383,62.067861,8.339795,1706.131256
min,4.555,0.0,31.29,1985.0,33.29
25%,8.785,0.026914,93.61005,1987.0,836.5777
50%,12.6,0.053799,142.4483,1999.0,1806.6483
75%,17.0,0.095273,185.06015,2004.0,3115.944
max,21.35,0.328391,266.8884,2009.0,13086.9648


---


#### **4. 데이터 전처리**
##### 스케일링 진행 전과 후의 모델 성능 비교

In [7]:
## 정답 label 변수에 옮긴 후 데이터셋 병합
y_train = train.pop('Item_Outlet_Sales')
df = pd.concat([train, test], axis=0)

## 레이블 인코딩
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
cols = train.select_dtypes(include='object').columns

for col in cols :
    df[col] = le.fit_transform(df[col])

train = df.iloc[:len(train)].copy()
test = df.iloc[len(train):].copy()

# print(train.shape, test.shape)

## 결측치 대체
train['Item_Weight'] = train['Item_Weight'].fillna(train['Item_Weight'].mean())
train['Outlet_Size'] = train['Outlet_Size'].fillna(train['Outlet_Size'].mode()[0])

test['Item_Weight'] = test['Item_Weight'].fillna(test['Item_Weight'].mean())
test['Outlet_Size'] = test['Outlet_Size'].fillna(test['Outlet_Size'].mode()[0])

# print(train.isnull().sum())
# print(test.isnull().sum())

## 컬럼 삭제
train.drop('Item_Identifier', axis=1, inplace=True)
test.drop('Item_Identifier', axis=1, inplace=True)

## 스케일링 (생략 가능)
# from sklearn.preprocessing import MinMaxScaler

# minmax = MinMaxScaler()
# cols = ['Item_Weight','Item_Visibility','Item_MRP','Outlet_Establishment_Year']

# train[cols] = minmax.fit_transform(train[cols])
# test[cols] = minmax.transform(test[cols])


---
#### **5. 검증 데이터 나누기**


In [8]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(train, y_train, test_size=0.2, random_state=42)

X_train.shape, X_val.shape, y_train.shape, y_val.shape

((5454, 10), (1364, 10), (5454,), (1364,))

---
#### **6. 머신러닝 학습 및 평가**
- 선형 회귀
- 랜덤 포레스트
- LightGBM


In [9]:
## 선형 회귀
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.metrics import root_mean_squared_error

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_val)

mse = mean_squared_error(y_val, y_pred)
print('MSE:', mse)

mae = mean_absolute_error(y_val, y_pred)
print('MAE:', mae)

r2 = r2_score(y_val, y_pred)
print('R2-score:', r2)

rmse = root_mean_squared_error(y_val, y_pred)
print('RMSE:', rmse)


MSE: 1441812.6369310766
MAE: 892.5632580973821
R2-score: 0.5172158035423333
RMSE: 1200.755027860003


In [10]:
## 랜덤 포레스트
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val)

mse = mean_squared_error(y_val, y_pred)
print('MSE:', mse)

mae = mean_absolute_error(y_val, y_pred)
print('MAE:', mae)

r2 = r2_score(y_val, y_pred)
print('R2-score:', r2)

rmse = root_mean_squared_error(y_val, y_pred)
print('RMSE:', rmse)

MSE: 1171518.6540172105
MAE: 759.8072649076247
R2-score: 0.607722475495335
RMSE: 1082.3671530572287


In [11]:
## LightGBM
import lightgbm as lgb

model = lgb.LGBMRegressor(random_state=0, verbose=-1)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)

mse = mean_squared_error(y_val, y_pred)
print('MSE:', mse)

mae = mean_absolute_error(y_val, y_pred)
print('MAE:', mae)

r2 = r2_score(y_val, y_pred)
print('R2-score:', r2)

rmse = root_mean_squared_error(y_val, y_pred)
print('RMSE:', rmse)

MSE: 1161959.080485737
MAE: 745.9307379894622
R2-score: 0.6109234538386057
RMSE: 1077.9420580373219


---
#### **7. 예측 및 결과 파일 생성**

In [12]:
pred = model.predict(test)
pred

array([1358.60132337,  769.5869624 , 1908.01527082, ..., 4123.71144336,
        834.65130901, 1320.83502939])

In [13]:
result = pd.DataFrame({'pred':pred})
result.to_csv('result2.csv', index=False)

In [14]:
result.shape

(1705, 1)