## 다중 회귀 연습문제
-------
  회귀 문제를 풀때 분류 문제와 가장 큰 차이점은 모델과 평가지표가 다르다는 것이다.
  예를 들어 분류에서 RandomForestClassifier을 사용한다면 회귀에서는 RandomForestRegressor을 사용한다.
  간혹 분류 모델을 사용해 예측값 성능이 현저히 떨어져 0점 처리되는 경우가 있다.

  따라서 문제에서 필요한 지표가 사이킷런에 포함되어 있지 않다면 직접 구현해야 할 수도 있다. 이 경우에는 알고 있는 회귀 평가 지표를 사용해 문제를 해결하는 것이 좋다.

### 항공권 가격 예측
- 항공권 티켓 가격을 예측하시오
  - 제공된 데이터 목록 : flight_train.csv , flight_test.csv
  - 예측할 컬럼 : price

- 성능 평가 : RMSE

In [None]:
from google.colab import files
uploaded = files.upload()

Saving flight_train.csv to flight_train.csv
Saving flight_test.csv to flight_test.csv


In [None]:
import pandas as pd
train = pd.read_csv("flight_train.csv")
test = pd.read_csv("flight_test.csv")
train.shape, test.shape

((10505, 11), (4502, 10))

In [None]:
target = train.pop('price')

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10505 entries, 0 to 10504
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   airline           10505 non-null  object 
 1   flight            10505 non-null  object 
 2   source_city       10505 non-null  object 
 3   departure_time    10505 non-null  object 
 4   stops             10505 non-null  object 
 5   arrival_time      10505 non-null  object 
 6   destination_city  10505 non-null  object 
 7   class             10505 non-null  object 
 8   duration          10505 non-null  float64
 9   days_left         10505 non-null  int64  
dtypes: float64(1), int64(1), object(8)
memory usage: 820.8+ KB


In [None]:
train.isnull().sum()

Unnamed: 0,0
airline,0
flight,0
source_city,0
departure_time,0
stops,0
arrival_time,0
destination_city,0
class,0
duration,0
days_left,0


In [None]:
# 카테고리 비교
cols = train.select_dtypes(include='object').columns
for col in cols:
  set_train = set(train[col])
  set_test = set(test[col])
  same= set_train == set_test
  if same:
    print(col,"\n 일치한 컬럼")
  else:
    print(col,"\n 불일치 컬럼")

airline 
 일치한 컬럼
flight 
 불일치 컬럼
source_city 
 일치한 컬럼
departure_time 
 일치한 컬럼
stops 
 일치한 컬럼
arrival_time 
 일치한 컬럼
destination_city 
 일치한 컬럼
class 
 일치한 컬럼


In [None]:
train.describe()

Unnamed: 0,duration,days_left
count,10505.0,10505.0
mean,12.225536,26.050547
std,7.182264,13.539947
min,0.83,1.0
25%,6.75,15.0
50%,11.25,26.0
75%,16.17,38.0
max,40.5,49.0


In [None]:
train.shape,test.shape

((10505, 10), (4502, 10))

In [None]:
train = train.drop("flight",axis=1)
test = test.drop("flight",axis=1)

In [None]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)

from sklearn.model_selection import train_test_split
X_tr,X_val,y_tr,y_val = train_test_split(train,target,test_size=0.2,random_state=0)

print("분할된 데이터 크기")
print(X_tr.shape,X_val.shape,y_tr.shape,y_val.shape)

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rf.fit(X_tr,y_tr)
pred = rf.predict(X_val)

from sklearn.metrics import mean_squared_error
#RMSE 결과반환
result = mean_squared_error(y_val,pred,squared=False)
print("RMSE : ",result)

pred = rf.predict(test)
submit = pd.DataFrame({'pred':pred})
submit.to_csv("result.csv",index=False)

print(pd.read_csv('result.csv').head(3))

분할된 데이터 크기
(8404, 37) (2101, 37) (8404,) (2101,)
RMSE :  4376.841613585934
       pred
0  57356.34
1   5334.44
2  13244.83




### 성능 개선
    flight 컬럼은 포함하되 중복 제거하고 일부만(앞의 영문은 모두 제거) 포함하겠다.

In [19]:
import pandas as pd
train = pd.read_csv("flight_train.csv")
test = pd.read_csv("flight_test.csv")
train.shape, test.shape

FileNotFoundError: [Errno 2] No such file or directory: 'flight_train.csv'

In [None]:
target = train.pop('price')

In [None]:
train['flight']

Unnamed: 0,flight
0,UK-776
1,UK-852
2,6E-2348
3,AI-763
4,6E-752
...,...
10500,UK-864
10501,UK-774
10502,I5-1531
10503,UK-651


In [None]:
train['f2'] = train['flight'].str.split("-").str[1].astype(int)
test['f2'] = test['flight'].str.split("-").str[1].astype(int)

In [None]:
train = train.drop('flight',axis=1)
test = test.drop('flight',axis=1)

In [None]:
# 스케일링
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cols = ['duration','days_left']
train[cols] = scaler.fit_transform(train[cols])
test[cols] = scaler.transform(test[cols])

# 원-핫 인코딩
train = pd.get_dummies(train)
test= pd.get_dummies(test)

from sklearn.model_selection import train_test_split
X_tr,X_val,y_tr,y_val = train_test_split(train,target,test_size=0.2,random_state=0)

print("분할된 데이터 크기")
print(X_tr.shape,X_val.shape,y_tr.shape,y_val.shape)

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(max_depth=20,n_estimators=200,random_state=0)
rf.fit(X_tr,y_tr)
pred = rf.predict(X_val)

from sklearn.metrics import mean_squared_error
#RMSE 결과반환
result = mean_squared_error(y_val,pred,squared=False)
print("RMSE : ",result)



분할된 데이터 크기
(8404, 38) (2101, 38) (8404,) (2101,)
RMSE :  3675.155093297134




### 노트북 가격 예측
- 노트북 티켓 가격을 예측하시오
  - 제공된 데이터 목록 : laptop_train.csv , laptop_test.csv
  - 예측할 컬럼 : price

- 성능 평가 : R^2 (결정계수) 평가지표

In [1]:
from google.colab import files
uploaded = files.upload()

Saving laptop_train.csv to laptop_train.csv
Saving laptop_test.csv to laptop_test.csv


In [20]:
import pandas as pd
train = pd.read_csv("laptop_train.csv")
test = pd.read_csv("laptop_test.csv")
train.shape, test.shape

((91, 10), (39, 9))

In [3]:
train.head(3)

Unnamed: 0,Brand,Model,Series,Processor,Processor_Gen,RAM,Hard_Disk_Capacity,OS,Rating,Price
0,ASUS,VivoBook,15.0,i3,10th,8.0,512 GB SSD,Windows 11 Home,4.3,37940
1,DELL,Inspiron,,i3,11th,8.0,1 TB HDD,Windows 11 Home,3.7,39040
2,ASUS,VivoBook,15.0,i7,10th,16.0,512 GB SSD,Windows 11 Home,4.1,57940


In [12]:
cols = train.select_dtypes(include='object').columns
for col in cols:
  set_train = set(train[col])
  set_test = set(test[col])
  same = set_train == set_test
  if(same):
    print(col,'\t 동일함')
  else:
    print(col,'\t 동일하지않음')

Brand 	 동일하지않음
Model 	 동일하지않음
Series 	 동일하지않음
Processor 	 동일하지않음
Processor_Gen 	 동일하지않음
Hard_Disk_Capacity 	 동일하지않음
OS 	 동일하지않음


In [13]:
train.isnull().sum()

Unnamed: 0,0
Brand,0
Model,9
Series,36
Processor,5
Processor_Gen,5
RAM,6
Hard_Disk_Capacity,6
OS,6
Rating,0
Price,0


In [14]:
# Model / Series / Processor / Processor_Gen /  Hard_Disk_Capacity / OS / RAM
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 0 to 90
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Brand               91 non-null     object 
 1   Model               82 non-null     object 
 2   Series              55 non-null     object 
 3   Processor           86 non-null     object 
 4   Processor_Gen       86 non-null     object 
 5   RAM                 85 non-null     float64
 6   Hard_Disk_Capacity  85 non-null     object 
 7   OS                  85 non-null     object 
 8   Rating              91 non-null     float64
 9   Price               91 non-null     int64  
dtypes: float64(2), int64(1), object(7)
memory usage: 7.2+ KB


In [15]:
target = train.pop('Price')

c_cols = ['Model','Series','Processor','Processor_Gen','Hard_Disk_Capacity','OS']
train[c_cols] = train[c_cols].fillna("X")
test[c_cols] = test[c_cols].fillna("X")

train['RAM']=train['RAM'].fillna(-1)
test['RAM']=test['RAM'].fillna(-1)

In [17]:
combined = pd.concat([train,test])
combined_dummies = pd.get_dummies(combined)
n_train = len(train)
train = combined_dummies[:n_train]
test = combined_dummies[n_train:]

from sklearn.model_selection import train_test_split
X_tr,X_val,y_tr,y_val = train_test_split(train,target,test_size=0.2,random_state=0)

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rf.fit(X_tr,y_tr)
pred = rf.predict(X_val)

from sklearn.metrics import r2_score
result = r2_score(y_val,pred)
print('\n r2:',result)

pred = rf.predict(test)
submit = pd.DataFrame({'pred':pred})
submit.to_csv("result.csv",index=False)


 r2: 0.7496764602229047


In [22]:
train.head(10)

Unnamed: 0,Brand,Model,Series,Processor,Processor_Gen,RAM,Hard_Disk_Capacity,OS,Rating,Price
0,ASUS,VivoBook,15,i3,10th,8.0,512 GB SSD,Windows 11 Home,4.3,37940
1,DELL,Inspiron,,i3,11th,8.0,1 TB HDD,Windows 11 Home,3.7,39040
2,ASUS,VivoBook,15,i7,10th,16.0,512 GB SSD,Windows 11 Home,4.1,57940
3,DELL,,,i3,10th,8.0,1 TB HDD,Windows 10,3.2,41340
4,Lenovo,IdeaPad,Slim,i3,11th,8.0,512 GB SSD,Windows 10 Home,4.4,45440
5,ASUS,TUF,Gaming,i5,11th,16.0,512 GB SSD,Windows 10 Home,4.6,89940
6,ASUS,VivoBook,Ultra,i3,11th,8.0,512 GB SSD,Windows 11 Home,4.8,42940
7,HP,,,i3,10th,8.0,512 GB SSD,Windows 10 Home,4.3,42340
8,APPLE,2020,Macbook,,,,,,4.6,129990
9,DELL,Inspiron,,i3,11th,8.0,256 GB SSD,Windows 11 Home,4.3,41540


##### 성능 개선
- Series 컬럼 삭제 : 40% 의 결측치를 가진 컬럼을 대체하지 않고 삭제.
- Model 컬럼 삭제 : Brand 컬럼이 Model 정보를 부분적으로 포함하고 있어서 삭제.

In [23]:
target = train.pop('Price')

train = train.drop('Series',axis=1)
test = test.drop('Series',axis=1)

train = train.drop('Model',axis=1)
test = test.drop('Model',axis=1)

c_cols = ['Brand','Processor',	'Processor_Gen',	'Hard_Disk_Capacity',	'OS',	'Rating']
train[c_cols] = train[c_cols].fillna('X')
test[c_cols] = test[c_cols].fillna('X')

train['RAM'] = train['RAM'].fillna(-1)
test['RAM'] = test['RAM'].fillna(-1)

combined = pd.concat([train,test])
combined_dummies = pd.get_dummies(combined)
n_train = len(train)
train = combined_dummies[:n_train]
test = combined_dummies[n_train:]

from sklearn.model_selection import train_test_split
X_tr,X_val,y_tr,y_val = train_test_split(train,target,test_size=0.2,random_state=0)

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rf.fit(X_tr,y_tr)
pred = rf.predict(X_val)

from sklearn.metrics import r2_score
result = r2_score(y_val,pred)
print("result:",result)

pred = rf.predict(test)
submit = pd.DataFrame({'pred':pred})
submit.to_csv("result.csv",index=False)

result: 0.8042392429064131


### 중고차 가격 예측
- 중고차 가격을 예측하시오
  - 제공된 데이터 목록 : car_train.csv , car_test.csv
  - 예측할 컬럼 : price

- 성능 평가 : RMSLE

In [24]:
from google.colab import files
uploaded = files.upload()

Saving car_test.csv to car_test.csv
Saving car_train.csv to car_train.csv


In [52]:
train = pd.read_csv('car_train.csv')
test = pd.read_csv('car_test.csv')
train.shape,test.shape

((6732, 17), (5772, 16))

In [28]:
train.isnull().sum()

Unnamed: 0,0
Price,0
Levy,0
Manufacturer,0
Model,0
Prod. year,0
Category,0
Leather interior,0
Fuel type,0
Engine volume,0
Mileage,0


In [30]:
test.isnull().sum()

Unnamed: 0,0
Levy,0
Manufacturer,0
Model,0
Prod. year,0
Category,0
Leather interior,0
Fuel type,0
Engine volume,0
Mileage,0
Cylinders,0


In [32]:
train.head(10)

Unnamed: 0,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,13956,603,LEXUS,RX 450,2015,Jeep,Yes,Hybrid,3.5,143619 km,6.0,Automatic,4x4,04-May,Left wheel,Black,12
1,26108,640,SSANGYONG,REXTON,2013,Jeep,Yes,Diesel,2,111307 km,4.0,Automatic,Front,04-May,Left wheel,White,4
2,549,1493,MERCEDES-BENZ,GLE 350,2016,Jeep,Yes,Petrol,3.5,91493 km,6.0,Automatic,Rear,04-May,Left wheel,Black,0
3,14113,475,FIAT,500,2012,Sedan,Yes,Petrol,1.4,88000 km,4.0,Tiptronic,Front,02-Mar,Left wheel,Black,6
4,21739,639,CHEVROLET,Orlando,2014,Jeep,Yes,Diesel,2,177103 km,4.0,Automatic,Front,04-May,Left wheel,White,4
5,19444,-,FORD,Transit,2008,Microbus,No,Diesel,2.4 Turbo,214000 km,4.0,Manual,Rear,04-May,Left wheel,Orange,2
6,19200,-,JAGUAR,XJ,1999,Sedan,Yes,Petrol,4,1000 km,8.0,Tiptronic,4x4,04-May,Left wheel,Green,7
7,51746,891,BMW,330,2016,Sedan,Yes,Petrol,2.0 Turbo,51000 km,4.0,Tiptronic,Rear,04-May,Left wheel,Blue,10
8,1411,640,SUZUKI,SX4,2013,Sedan,Yes,Petrol,2,193504 km,4.0,Automatic,Front,04-May,Left wheel,Red,0
9,11676,761,CHEVROLET,Lacetti,2010,Sedan,Yes,Petrol,1.8,153966 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


In [33]:
train.describe()

Unnamed: 0,Price,Prod. year,Cylinders,Airbags
count,6732.0,6732.0,6732.0,6732.0
mean,17018.565954,2010.997772,4.575609,6.551693
std,17497.072247,5.538817,1.209242,4.364451
min,3.0,1953.0,1.0,0.0
25%,5331.0,2009.0,4.0,4.0
50%,13172.0,2012.0,4.0,5.0
75%,21953.0,2015.0,4.0,12.0
max,228935.0,2020.0,16.0,16.0


In [35]:
cols = train.select_dtypes(include='O').columns
for col in cols:
  set_train = set(train[col])
  set_test = set(test[col])
  same = set_train==set_test
  if(same):
    print(col,"\t 동일함")
  else:
    print(col,"\t 동일함X")


Levy 	 동일함X
Manufacturer 	 동일함X
Model 	 동일함X
Category 	 동일함
Leather interior 	 동일함
Fuel type 	 동일함X
Engine volume 	 동일함X
Mileage 	 동일함X
Gear box type 	 동일함
Drive wheels 	 동일함
Doors 	 동일함
Wheel 	 동일함
Color 	 동일함


In [43]:
target= train.pop('Price')

In [45]:
from sklearn.preprocessing import LabelEncoder
combined = pd.concat([train,test])
cols = train.select_dtypes(include='O').columns

for col in cols:
  le = LabelEncoder()
  combined[col] = le.fit_transform(combined[col])

n_train = len(train)
train = combined[:n_train]
test = combined[n_train:]

from sklearn.model_selection import train_test_split
X_tr,X_val,y_tr,y_val = train_test_split(train,target,test_size=0.2,random_state=0)

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rf.fit(X_tr,y_tr)
pred = rf.predict(X_val)

from sklearn.metrics import mean_squared_log_error
result = mean_squared_log_error(y_val,pred) **0.5
print("RMLE:",result)

pred=  rf.predict(test)
submit = pd.DataFrame({'pred':pred})
submit.to_csv("result.csv",index=False)

RMLE: 1.1008952910276844


##### 성능개선
- Engine volume
- Mileage

In [47]:
train['Engine volume'].value_counts()

Unnamed: 0_level_0,count
Engine volume,Unnamed: 1_level_1
2,1342
2.5,823
1.8,623
1.6,533
1.5,453
...,...
0.8 Turbo,1
3.1,1
4.6 Turbo,1
4.2 Turbo,1


In [49]:
train['Mileage'].value_counts()

Unnamed: 0_level_0,count
Mileage,Unnamed: 1_level_1
0 km,235
200000 km,62
150000 km,48
100000 km,46
120000 km,39
...,...
216751 km,1
276000 km,1
44545 km,1
99162 km,1


In [53]:
target = train.pop('Price')
train['turbo'] = train['Engine volume'].str.contains('Turbo').astype(int)
train['Engine volume'] = train['Engine volume'].str.replace('Turbo','').astype(float)

test['turbo'] = test['Engine volume'].str.contains('Turbo').astype(int)
test['Engine volume'] = test['Engine volume'].str.replace('Turbo','').astype(float)

train['Mileage'] = train['Mileage'].str.split().str[0].astype(int)
test['Mileage'] = test['Mileage'].str.split().str[0].astype(int)

from sklearn.preprocessing import LabelEncoder
combined = pd.concat([train,test])
cols = train.select_dtypes(include='O').columns

for col in cols:
  le = LabelEncoder()
  combined[col] = le.fit_transform(combined[col])

n_train = len(train)
train = combined[:n_train]
test = combined[n_train:]

from sklearn.model_selection import train_test_split
X_tr,X_val,y_tr,y_val = train_test_split(train,target,test_size=0.2,random_state=0)

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=0)
rf.fit(X_tr,y_tr)
pred = rf.predict(X_val)

from sklearn.metrics import mean_squared_log_error
result = mean_squared_log_error(y_val,pred) ** 0.5
print("result: ",result)

pred = rf.predict(test)
submit = pd.DataFrame({'pred':pred})
submit.to_csv("result.csv",index=False)

result:  1.0823364430321651
