## **회귀분석**
    X 데이터(독립변수, 피처)와 Y 데이터(종속변수, 타깃)간의 관계를 모델링하고, 새로 주어진 X데이터에 대해 연속적인 값을 예측한다.
    
--------------------------------------
- 회귀 문제를 구분하는 세 가지 방법이 있다.
  1. 문제를 통해 파악한다.
  
          Target에 대한 설명은 반드시 있다. 컬럼명 또는 설명을 읽고 파악한다.
          예를 들어 확률값을 구하라고 한다면 분류다. 카테고리가 0 또는 1로 구분되어도 분류다. 회귀는 수요량, 사용량, 판매량 등이 될 수 있다.
  2. target(label) 값을 확인한다.

          데이터 샘플을 확인했을 때 연속형 숫자인지, 몇몇 값이 반복되는 카테고리인지 확인한다.
          df['target'].value_counts()로 확인했을 때 종류가 많으면 회귀고, 한눈에 들어오면 분류일 가능성이 크다.

  3. 평가지표를 확인한다.
  
          평가지표를 어떤것을 사용하는지에 따라 구분할 수 있다.
          예를 들어, MAE,MSE,RMSE 등(Error)가 붙어 있으면 회귀다.
          

Q. 10개의 아울렛 매장에서 1,500여 개의 제품에 대한 판매 데이터를 수집했다.
예측 모델을 만들고 아울렛 특정 매장에서 각 제품의 판매금액을 예측하시오.

- 평가 기준은 RMSE로 평가
- lable(target)은 판매금액(Item_Outlet_Sale)
- 제출 파일은 예측값만 result.csv 파일로 생성해 제출(pred 컬럼)

In [1]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train.csv
Saving test.csv to test.csv


In [5]:
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,NCR06,12.5,Low Fat,0.00676,Household,42.8112,OUT013,1987,High,Tier 3,Supermarket Type1,639.168
1,FDW11,12.6,Low Fat,0.048741,Breads,60.4194,OUT013,1987,High,Tier 3,Supermarket Type1,990.7104
2,FDH32,12.8,Low Fat,0.075997,Fruits and Vegetables,97.141,OUT013,1987,High,Tier 3,Supermarket Type1,2799.689
3,FDL52,6.635,Regular,0.046351,Frozen Foods,37.4506,OUT017,2007,,Tier 2,Supermarket Type1,1176.4686
4,FDO09,13.5,Regular,0.12517,Snack Foods,261.491,OUT013,1987,High,Tier 3,Supermarket Type1,3418.883


In [6]:
train.isnull().sum()
# Item_Weight float형 / Outlet_Size object형

Unnamed: 0,0
Item_Identifier,0
Item_Weight,1162
Item_Fat_Content,0
Item_Visibility,0
Item_Type,0
Item_MRP,0
Outlet_Identifier,0
Outlet_Establishment_Year,0
Outlet_Size,1940
Outlet_Location_Type,0


In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6818 entries, 0 to 6817
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            6818 non-null   object 
 1   Item_Weight                5656 non-null   float64
 2   Item_Fat_Content           6818 non-null   object 
 3   Item_Visibility            6818 non-null   float64
 4   Item_Type                  6818 non-null   object 
 5   Item_MRP                   6818 non-null   float64
 6   Outlet_Identifier          6818 non-null   object 
 7   Outlet_Establishment_Year  6818 non-null   int64  
 8   Outlet_Size                4878 non-null   object 
 9   Outlet_Location_Type       6818 non-null   object 
 10  Outlet_Type                6818 non-null   object 
 11  Item_Outlet_Sales          6818 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 639.3+ KB


In [11]:
test.isnull().sum()
# Item_Weight , float 형 /  Outlet_Size , object형

Unnamed: 0,0
Item_Identifier,0
Item_Weight,301
Item_Fat_Content,0
Item_Visibility,0
Item_Type,0
Item_MRP,0
Outlet_Identifier,0
Outlet_Establishment_Year,0
Outlet_Size,470
Outlet_Location_Type,0


In [12]:
train.describe(include='O')

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
count,6818,6818,6818,6818,4878,6818,6818
unique,1554,5,16,10,3,3,4
top,FDW26,Low Fat,Snack Foods,OUT046,Medium,Tier 3,Supermarket Type1
freq,9,4092,963,763,2228,2664,4474


In [15]:
# 인코딩
list = (train.columns[train.dtypes=='O'])
list

Index(['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object')

In [16]:
target = train.pop('Item_Outlet_Sales')
print(train.shape,test.shape)

(6818, 11) (1705, 11)


In [18]:
df=pd.concat([train,test])
df.shape

(8523, 11)

In [20]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder();
for col in list:
  df[col] = le.fit_transform(df[col])

In [23]:
train = df.iloc[:len(train)]
test = df.iloc[len(train):]
train.shape,test.shape

((6818, 11), (1705, 11))

In [25]:
# 결측치 처리
# Item_Weight , float 형 /  Outlet_Size , object형
train['Item_Weight'] = train['Item_Weight'].fillna(train['Item_Weight'].mean())
test['Item_Weight'] = test['Item_Weight'].fillna(train['Item_Weight'].mean())

train['Outlet_Size'] = train['Outlet_Size'].fillna(train['Outlet_Size'].mode()[0])
test['Outlet_Size'] = test['Outlet_Size'].fillna(train['Outlet_Size'].mode()[0])
train.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['Item_Weight'] = train['Item_Weight'].fillna(train['Item_Weight'].mean())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Item_Weight'] = test['Item_Weight'].fillna(train['Item_Weight'].mean())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['Outlet_Size'] = train['Outlet_Size'].fil

Unnamed: 0,0
Item_Identifier,0
Item_Weight,0
Item_Fat_Content,0
Item_Visibility,0
Item_Type,0
Item_MRP,0
Outlet_Identifier,0
Outlet_Establishment_Year,0
Outlet_Size,0
Outlet_Location_Type,0


In [27]:
# item_id 는 삭제하자. Item_Identifier
print(train.shape,test.shape)
train.drop('Item_Identifier',axis=1,inplace=True)
test.drop('Item_Identifier',axis=1,inplace=True)
print(train.shape,test.shape)

(6818, 11) (1705, 11)
(6818, 10) (1705, 10)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train.drop('Item_Identifier',axis=1,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test.drop('Item_Identifier',axis=1,inplace=True)


In [28]:
# 검증 데이터 나누기
from sklearn.model_selection import train_test_split
X_tr,X_val,y_tr,y_val = train_test_split(train,target,test_size=0.2,random_state=0)
X_tr.shape,X_val.shape,y_tr.shape,y_val.shape

((5454, 10), (1364, 10), (5454,), (1364,))

In [33]:
# 학습 - 선형 회귀(random_state 파라미터 존재 X)
from sklearn.linear_model import LinearRegression
le = LinearRegression()
le.fit(X_tr,y_tr)
pred = le.predict(X_val)

from sklearn.metrics import mean_squared_error

def rsme(y_val,y_pred):
  return mean_squared_error(y_val,y_pred)**0.5

result = rsme(y_val,pred)
print("rsme: ",result)

rsme:  1138.4584904984333


In [39]:
# 학습 - 랜덤 포레스트(회귀, 분류 둘 다 가능)
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rf.fit(X_tr,y_tr)
pred = rf.predict(X_val)

result = rsme(y_val,pred)
print('rmse',result)

rmse 1046.9152778031398


In [41]:
# 학습 - 라이트비지엠(회귀 모델)
import lightgbm as lgb
model  = lgb.LGBMRegressor(random_state=0,verbose=1)
model.fit(X_tr,y_tr)
pred = model.predict(X_val)

result = rsme(y_val,pred)
print('rsme : ',result)

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001260 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 787
[LightGBM] [Info] Number of data points in the train set: 5454, number of used features: 10
[LightGBM] [Info] Start training from score 2202.546849
rsme :  1051.3379139047024


In [42]:
pred = model.predict(test)
pred

array([1262.38917313,  802.64043321, 1702.96570762, ..., 4043.62984266,
        843.27390946, 1345.05171333])

In [44]:
submit = pd.DataFrame({'pred':pred})
submit.to_csv("result.csv",index=False)

pd.read_csv('result.csv')

Unnamed: 0,pred
0,1262.389173
1,802.640433
2,1702.965708
3,1457.078186
4,2705.996738
...,...
1700,279.653313
1701,639.187489
1702,4043.629843
1703,843.273909
