# Pandas Exercises for Data Analysis

*by Selva Prabhakaran*
From the website: https://www.machinelearningplus.com/python/101-pandas-exercises-python/

[Pandas Cheet Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

#### 1. (L1) pandas를 `pd`로 임포트 및 버전 확인

In [1]:
import pandas as pd

pd.__version__

'1.5.3'

#### 2. (L1) 1차원 배열을 pandas series 형태로 출력한다.

In [2]:
import numpy as np

mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)

ser = ser.to_frame().reset_index().iloc[:5]
# ser.to_frame().reset_index().iloc[:5].rename(columns={'index': 'letters', 0: 'numbers'})
ser

Unnamed: 0,index,0
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


Unnamed: 0,letters,numbers
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


#### 3. (L1) 두 개의 pandas series를 연결하여 DataFrame의 형태로 나타낸다.

In [9]:
ser1 = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

### TODO ###
pd.???([ser1, ser2], axis=1).iloc[:5]

Unnamed: 0,0,1
0,a,0
1,b,1
2,c,2
3,d,3
4,e,4


In [11]:
### TODO ###
pd.??([ser1, ser2]).iloc[:7]

0    a
1    b
2    c
3    d
4    e
5    f
6    g
dtype: object

#### 4. (L2) 두 개의 pandas series에서 겹치는 값을 제외하고 연결한다.

In [8]:
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

union = np.union1d(ser1, ser2)
intersection = np.intersect1d(ser1, ser2)

print(union, intersection)
union[~pd.Series(union).isin(intersection)]

[1 2 3 4 5 6 7 8] [4 5]


array([1, 2, 3, 6, 7, 8])

#### 5. (L2) pandas series에서 25, 75 퍼센트에 위치하는 값을 추출한다.

In [68]:
np.random.seed(0)
ser = pd.Series(np.random.normal(10, 5, 25))
ser.head()

0    18.820262
1    12.000786
2    14.893690
3    21.204466
4    19.337790
dtype: float64

In [69]:
for method in [pd.Series.max, pd.Series.min, pd.Series.median]:
    print(method(ser))
print(ser.quantile(0.25))
print(ser.quantile(0.75))

21.348773119938038
-2.7649490791703935
12.052992509691862
9.483905741032212
14.893689920528695


#### 7. (L1) pandas series를 7x5 형태의 dataframe으로 변환한다.

In [21]:
np.random.seed(0) # 랜덤 시드 고정
ser = pd.Series(np.random.randint(1, 10, 35))
ser.head()

0    6
1    1
2    4
3    4
4    8
dtype: int64

In [22]:
pd.DataFrame(
    ser.to_numpy().reshape((7,5))
)

Unnamed: 0,0,1,2,3,4
0,6,1,4,4,8
1,4,6,3,5,8
2,7,9,9,2,7
3,8,8,9,2,6
4,9,5,4,1,4
5,6,1,3,4,9
6,2,4,4,4,8


In [27]:
pd.DataFrame(ser).head()

Unnamed: 0,0
0,6
1,1
2,4
3,4
4,8


#### 8. (L1) 두 개의 pandas series를 수직, 수평으로 이어붙인다.

In [31]:
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))
print('[ser1]')
print(ser1, end='\n\n')
print('[ser2]')
print(ser2)

[ser1]
0    0
1    1
2    2
3    3
4    4
dtype: int64

[ser2]
0    a
1    b
2    c
3    d
4    e
dtype: object


In [None]:
### TODO - Start ###
horizontally = pd.?????(????, ????=??)
vertically = pd.?????(????, ????=??)
### TODO - End ###

In [17]:
print(horizontally, vertically)

0    0
1    1
2    2
3    3
4    4
0    a
1    b
2    c
3    d
4    e
dtype: object    0  1
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e


#### 9. (L2) 실제값과 예측값 사이의 차이값을 구한다. (MSE)

**NOTE**: This question means that we need to calculate the mean squared error between the two series, using the formula:

$$MSE = \dfrac{1}{n} * \sum \left(truth - pred\right)^2$$

In [None]:
np.random.seed(0)
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

### TODO ###
mse = ????

In [None]:
mse

0.41319910235287544

#### 10. (L2) 각 값들을 Uppercase로 만든다.

In [20]:
ser = pd.Series(['how', 'to', 'kick', 'ass?'])
ser.apply(str.title)

0     How
1      To
2    Kick
3    Ass?
dtype: object

#### 11. (L2) pandas series 앞의 값을 뺀 차이 값을 획득한다.

In [14]:
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

print(ser.tolist())
print(ser.diff().tolist())
print(ser.diff().diff().tolist())

[1, 3, 6, 10, 15, 21, 27, 35]
[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]


#### 12. (L2) pandas series를 date-strings로 변환한다.

In [33]:
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
ser

0         01 Jan 2010
1          02-02-2011
2            20120303
3          2013/04/04
4          2014-05-05
5    2015-06-06T12:20
dtype: object

In [None]:
### TODO ###
ser.astype(??)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

#### 13. (L2) pandas series로 부터 다음의 값을 추출한다.

In [None]:
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

# Manual
dates = []
week_numbers = []
day_numbers = []
weekdays = []
weekday_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Friday']

### TODO - Start ###
for index, date in enumerate(ser.astype('datetime64[ns]')):
    dates.append(??)
    week_numbers.append(date.??)
    day_numbers.append(date.day_of_year)
    weekdays.append(weekday_names[date.weekday()])
### TODO - End ###

print(dates)
print(week_numbers)
print(day_numbers)
print(weekdays)

[0, 1, 2, 3, 4, 5]
[53, 5, 9, 14, 19, 23]
[1, 33, 63, 94, 125, 157]
['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']


#### 14. (L2) `fruit` 정보에서 각 과일별로 `weights`의 평균을 계산한다.

In [41]:
np.random.seed(0)
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))
weights.head()

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [38]:
fruits_dataframe = pd.DataFrame(
    np.vstack([fruit, weights])
).T
fruits_dataframe.head()

Unnamed: 0,0,1
0,banana,1.0
1,banana,2.0
2,banana,3.0
3,apple,4.0
4,carrot,5.0


In [39]:
fruits_dataframe.columns = ["fruit", "weight"]
fruits_dataframe.head()

Unnamed: 0,fruit,weight
0,banana,1.0
1,banana,2.0
2,banana,3.0
3,apple,4.0
4,carrot,5.0


In [18]:
fruits_dataframe.groupby('fruit').mean()

Unnamed: 0_level_0,weight
fruit,Unnamed: 1_level_1
apple,4.75
banana,5.4
carrot,9.0


#### 15. (L3) missing 값을 가장 빈번히 등장하는 값으로 치환한다.

In [41]:
my_str = 'dbc deb abed gade'

uniques, counts = np.unique(
    pd.Series([*my_str]),
    return_counts=True
)

print(uniques)
print(counts)

[' ' 'a' 'b' 'c' 'd' 'e' 'g']
[3 2 3 1 4 3 1]


In [42]:
least_frequent_character = uniques[np.argmin(counts)]
least_frequent_character

'c'

In [None]:
### TODO ###
my_str.????(??, ??)

'dbccdebcabedcgade'

#### 16. (L3) `2000-01-01`에서 시작하여 10개의 week을 나타내고, 각 주별로 랜덤된 숫자를 가지는 DataFrame을 생성한다.

In [40]:
np.random.seed(0)
pd.DataFrame({
    'date': pd.date_range('2020-01-01', periods=10, freq='7D'),
    'number': np.random.randint(low=0, high=100, size=10)
})

Unnamed: 0,date,number
0,2020-01-01,44
1,2020-01-08,47
2,2020-01-15,64
3,2020-01-22,67
4,2020-01-29,67
5,2020-02-05,9
6,2020-02-12,83
7,2020-02-19,21
8,2020-02-26,36
9,2020-03-04,87


#### 17. (L2) `boston` 데이터 셋에서 데이터의 일부를 추출한다.
- 50개씩 chunking 한 후에 각 chunk 마다 1번째 행을 가져와서 새로운 데이터로 생성한다.

In [55]:
BOSTON_URL = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
boston = pd.read_csv(BOSTON_URL, chunksize=50)

In [56]:
chunk_list = [chunk.iloc[0,:] for chunk in boston]
chunk_list[0]

crim         0.00632
zn          18.00000
indus        2.31000
chas         0.00000
nox          0.53800
rm           6.57500
age         65.20000
dis          4.09000
rad          1.00000
tax        296.00000
ptratio     15.30000
b          396.90000
lstat        4.98000
medv        24.00000
Name: 0, dtype: float64

In [38]:
pd.DataFrame([
    chunk.iloc[0,:]  # This takes first line
    for chunk in pd.read_csv(
        BOSTON_URL,
        chunksize=50  # Of every chunk that is 50 rows long
    )
]).reset_index().drop('index', axis=1)

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.08873,21.0,5.64,0.0,0.439,5.963,45.7,6.8147,4.0,243.0,16.8,395.56,13.45,19.7
2,0.14866,0.0,8.56,0.0,0.52,6.727,79.9,2.7778,5.0,384.0,20.9,394.76,9.42,27.5
3,1.6566,0.0,19.58,0.0,0.871,6.122,97.3,1.618,5.0,403.0,14.7,372.8,14.1,21.5
4,0.01778,95.0,1.47,0.0,0.403,7.135,13.9,7.6534,3.0,402.0,17.0,384.3,4.45,32.9
5,0.1403,22.0,5.86,0.0,0.431,6.487,13.0,7.3967,7.0,330.0,19.1,396.28,5.9,24.4
6,0.04417,70.0,2.24,0.0,0.4,6.871,47.4,7.8278,5.0,358.0,14.8,390.86,6.07,24.8
7,0.06211,40.0,1.25,0.0,0.429,6.49,44.4,8.7921,1.0,335.0,19.7,396.9,5.98,22.9
8,25.0461,0.0,18.1,0.0,0.693,5.987,100.0,1.5888,24.0,666.0,20.2,396.9,26.77,5.6
9,6.71772,0.0,18.1,0.0,0.713,6.749,92.6,2.3236,24.0,666.0,20.2,0.32,17.44,13.4


#### 18. (L2) 보스턴 집값 데이터를 불러온다. 집값의 중앙값 (`medv`)을 기준으로 25보다 작으면 'Low'를 25보다 크다면 'High'로 치환하여 데이터를 불러온다.

In [None]:
def categorize(row: np.generic):
    ### TODO ###
    ????
    return row

via_apply = pd.read_csv(BOSTON_URL).apply(categorize, axis=1) 
via_apply

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,Low
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,Low
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,High
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,High
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,High
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,Low
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,Low
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,Low
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,Low


#### 19. (L2) 보스턴 집값 데이터를 불러온 후에 데이터의 정보를 확인한다.

In [40]:
boston = pd.read_csv(BOSTON_URL)

print(
    boston.info(),
    '\n',
    boston.describe(),
    '\n',
    boston.to_numpy(),
    '\n',
    boston.to_numpy().tolist()
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  b        506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB
None 
              crim          zn       indus        chas         nox          rm  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.613524   11.363636   11.136779    0.069170    0.554695 

#### 20. (L1) 자동차 데이터에서 가장 갑이 비싼 자동차의 'Manufacturer', 'Model', 'Type'을 반환한다.

In [57]:
CARS93_URL = "https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv"
cars = pd.read_csv(CARS93_URL)
cars.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,38.7,18.0,25.0,Driver & Passenger,Front,...,5.0,195.0,115.0,71.0,38.0,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i


In [59]:
cars.loc[cars['Price'].argmax()].head()

Manufacturer    Mercedes-Benz
Model                    300E
Type                  Midsize
Min.Price                43.8
Price                    61.9
Name: 58, dtype: object

In [None]:
### TODO ###
cars.loc[cars['Price'].argmax(), ????]

Manufacturer    Mercedes-Benz
Model                    300E
Type                  Midsize
Name: 58, dtype: object

#### 21. (L2) 자동차 데이터의 열 이름 중 `Type`을 `CarType`으로 바꾸고, `.`이 들어있다면 `_`로 대체한다.

In [None]:
cars = pd.read_csv(CARS93_URL)

### TODO - Start ###
cars.columns = [
    column.??(??, ??)
    for column in cars.columns
]
cars.??(columns={'Type': 'CarType'})
### TODO - End ###

cars.columns

Index(['Manufacturer', 'Model', 'CarType', 'Min_Price', 'Price', 'Max_Price',
       'MPG_city', 'MPG_highway', 'AirBags', 'DriveTrain', 'Cylinders',
       'EngineSize', 'Horsepower', 'RPM', 'Rev_per_mile', 'Man_trans_avail',
       'Fuel_tank_capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',
       'Turn_circle', 'Rear_seat_room', 'Luggage_room', 'Weight', 'Origin',
       'Make'],
      dtype='object')

#### 22. (L2) 자동차 데이터에서 각 열별로 nan 값을 확인한다.

In [64]:
cars = pd.read_csv(CARS93_URL)
cars.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,38.7,18.0,25.0,Driver & Passenger,Front,...,5.0,195.0,115.0,71.0,38.0,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i


In [65]:
nans = cars.isna().astype(int).sum(axis=0)
nans.head()

Manufacturer    4
Model           1
Type            3
Min.Price       7
Price           2
dtype: int64

In [None]:
nans[
    nans == nans.max()
]

Luggage.room    19
dtype: int64

#### 23. (L2) nan 값이 `Min.Price` 혹은 `Max.Price`에 존재하면 평균 값으로 대채한다.

In [None]:
cars = pd.read_csv(CARS93_URL)

for column in ['Min.Price', 'Max.Price']:
    cars[column].fillna(
        cars[column].mean().round(1),
        inplace=True
    )

In [52]:
cars

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,38.7,18.0,25.0,Driver & Passenger,Front,...,5.0,195.0,115.0,71.0,38.0,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,17.1,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,17.1,30.0,21.5,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,Volkswagen,Eurovan,Van,16.6,19.7,22.7,17.0,21.0,,Front,...,7.0,187.0,115.0,72.0,38.0,34.0,,3960.0,,Volkswagen Eurovan
89,Volkswagen,Passat,Compact,17.6,20.0,22.4,21.0,30.0,,Front,...,5.0,180.0,103.0,67.0,35.0,31.5,14.0,2985.0,non-USA,Volkswagen Passat
90,Volkswagen,Corrado,Sporty,22.9,23.3,23.7,18.0,25.0,,Front,...,4.0,159.0,97.0,66.0,36.0,26.0,15.0,2810.0,non-USA,Volkswagen Corrado
91,Volvo,240,Compact,21.8,22.7,23.5,21.0,28.0,Driver only,Rear,...,5.0,190.0,104.0,67.0,37.0,29.5,14.0,2985.0,non-USA,Volvo 240


#### 24. (L1) 데이터에 `Manufacturer`, `Model` and `Type`만 출력한다.

In [None]:
dataframe = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

### TODO ###
dataframe[????].iloc[::20]

Unnamed: 0,Manufacturer,Model,Type
0,Acura,Integra,Small
20,Chrysler,LeBaron,Compact
40,Honda,Prelude,Sporty
60,Mercury,Cougar,Midsize
80,Subaru,Loyale,Small


#### 25. (L2) nan값을 'missing'이라는 값으로 변경하고, 'Manufacturer', 'Model' and 'Type'를 연결하여 Primary Key값으로 만든다.

Desired output:

```python
                       Manufacturer    Model     Type  Min.Price  Max.Price
Acura_Integra_Small           Acura  Integra    Small       12.9       18.8
missing_Legend_Midsize      missing   Legend  Midsize       29.2       38.7
Audi_90_Compact                Audi       90  Compact       25.9       32.3
Audi_100_Midsize               Audi      100  Midsize        NaN       44.6
BMW_535i_Midsize                BMW     535i  Midsize        NaN        NaN
```

In [66]:
dataframe = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', usecols=[0,1,2,3,5])
dataframe.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Max.Price
0,Acura,Integra,Small,12.9,18.8
1,,Legend,Midsize,29.2,38.7
2,Audi,90,Compact,25.9,32.3
3,Audi,100,Midsize,,44.6
4,BMW,535i,Midsize,,


In [67]:
dataframe[['Manufacturer', 'Model', 'Type']] = dataframe[['Manufacturer', 'Model', 'Type']].fillna(value='missing')
dataframe.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Max.Price
0,Acura,Integra,Small,12.9,18.8
1,missing,Legend,Midsize,29.2,38.7
2,Audi,90,Compact,25.9,32.3
3,Audi,100,Midsize,,44.6
4,BMW,535i,Midsize,,


In [54]:
dataframe.set_index(
    dataframe['Manufacturer'] + '_' + dataframe['Model'] + '_' + dataframe['Type']
)

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Max.Price
Acura_Integra_Small,Acura,Integra,Small,12.9,18.8
missing_Legend_Midsize,missing,Legend,Midsize,29.2,38.7
Audi_90_Compact,Audi,90,Compact,25.9,32.3
Audi_100_Midsize,Audi,100,Midsize,,44.6
BMW_535i_Midsize,BMW,535i,Midsize,,
...,...,...,...,...,...
Volkswagen_Eurovan_Van,Volkswagen,Eurovan,Van,16.6,22.7
Volkswagen_Passat_Compact,Volkswagen,Passat,Compact,17.6,22.4
Volkswagen_Corrado_Sporty,Volkswagen,Corrado,Sporty,22.9,23.7
Volvo_240_Compact,Volvo,240,Compact,21.8,23.5


#### 26. (L2) 'a' 열에서 5번째로 큰 값을 가지는 행을 반환한다.

In [39]:
np.random.seed(0)
dataframe = pd.DataFrame(np.random.randint(1, 30, 30).reshape(10,-1), columns=list('abc'))

np.argsort(  # Sort
    dataframe['a']
)[::-1][5]  # Reverse and take the 5th element

3

#### 27. (L2) pandas series에서 5%이하 값은 5%값으로, 95%이상 값은 95%값으로 대치한다. (이상값 처리)

In [57]:
ser = pd.Series(np.logspace(-2, 2, 30))

ser[
    ser < ser.quantile(0.05)
] = ser.quantile(0.05)
ser[
    ser > ser.quantile(0.95)
] = ser.quantile(0.95)

ser[:5]

0    0.016049
1    0.016049
2    0.018874
3    0.025929
4    0.035622
dtype: float64

#### 28. (L3) 각 행이 4차원을 가지고 있다고 할 때, 각 행별로 가장 가까운 행의 이름을 발견하고 거리를 구한다.

In [35]:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,100,40).reshape(10, -1), columns=list('pqrs'), index=list('abcdefghij'))
df.head()

Unnamed: 0,p,q,r,s
a,45,48,65,68
b,68,10,84,22
c,37,88,71,89
d,89,13,59,66
e,40,88,47,89


In [36]:
euclidean_distances = {
    'nearest_row': [],
    'dist': []
}

In [37]:
df.shape

(10, 4)

In [38]:
for first_row in range(0, df.shape[0]): # 각 줄(행)을 기준으로
    distances = []
    ids = []
    for second_row in range(0, df.shape[0]): # 모든 줄(행)과의 거리를 계산
        if first_row == second_row: # 같은 줄(행)이면 건너뛰기
            continue
        euclidean_distance = np.linalg.norm(
            (df.iloc[second_row] - df.iloc[first_row])
        )
        distances.append(euclidean_distance) # 거리 저장
        ids.append(df.iloc[second_row].name) # 줄(행) 이름 저장

    series = pd.Series(distances, index=ids)
    euclidean_distances['nearest_row'].append(
        series.index[series.argmin()]
    )
    euclidean_distances['dist'].append(
        series.min()
    )
    
distances_df = pd.DataFrame(euclidean_distances, index=df.index)
distances_df

Unnamed: 0,nearest_row,dist
a,h,44.124823
b,d,54.87258
c,e,24.186773
d,f,43.669211
e,c,24.186773
f,g,29.983329
g,f,29.983329
h,i,38.457769
i,h,38.457769
j,g,68.014704


In [None]:
### TODO ###
pd.????([df, distances_df], ??=??)

Unnamed: 0,p,q,r,s,nearest_row,dist
a,45,48,65,68,h,44.124823
b,68,10,84,22,d,54.87258
c,37,88,71,89,e,24.186773
d,89,13,59,66,f,43.669211
e,40,88,47,89,c,24.186773
f,82,38,26,78,g,29.983329
g,73,10,21,81,f,29.983329
h,70,80,48,65,i,38.457769
i,83,89,50,30,h,38.457769
j,20,20,15,40,g,68.014704


#### 29. (L2) 생성한 데이터에서 최소/최대값을 계산한다.

In [12]:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,45,48,65,68,68,10,84,22,37,88
1,71,89,89,13,59,66,40,88,47,89
2,82,38,26,78,73,10,21,81,70,80
3,48,65,83,89,50,30,20,20,15,40
4,33,66,10,58,33,32,75,24,36,76


In [None]:
df.apply(lambda row: row.min()/row.max(), axis=1)

0    0.113636
1    0.146067
2    0.121951
3    0.168539
4    0.131579
5    0.017857
6    0.025000
7    0.010870
dtype: float64

#### 30. (L2) 각 열별 값을 기준으로 정규화한 데이터를 가진다. 최대 최소는 각각 0, 1을 가지도록 만든다.

In [19]:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,45,48,65,68,68,10,84,22,37,88
1,71,89,89,13,59,66,40,88,47,89
2,82,38,26,78,73,10,21,81,70,80
3,48,65,83,89,50,30,20,20,15,40
4,33,66,10,58,33,32,75,24,36,76


In [None]:
df.apply(lambda col: (col - col.mean()) / col.std()).apply(lambda col: col - col.min()).apply(lambda col: (col / col.max()).round(2))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.47,0.51,0.7,0.74,0.93,0.12,1.0,0.2,0.4,0.99
1,0.84,1.0,1.0,0.13,0.81,1.0,0.31,1.0,0.58,1.0
2,1.0,0.39,0.2,0.85,1.0,0.12,0.02,0.91,1.0,0.87
3,0.51,0.71,0.92,0.97,0.68,0.44,0.0,0.17,0.0,0.31
4,0.3,0.73,0.0,0.63,0.44,0.47,0.86,0.22,0.38,0.82
5,0.63,0.29,0.32,0.0,0.0,0.55,0.53,0.0,0.44,0.0
6,0.97,0.0,0.42,0.64,0.43,0.0,0.72,0.44,0.78,0.25
7,0.0,0.5,0.92,1.0,0.0,0.2,0.53,0.09,0.51,0.94


#### 31. (L2) 대각행렬의 값을 0으로 치환한다.

In [21]:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))

eye = pd.DataFrame(np.eye(*df.shape))
eye.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df[
    eye != 1.0
][
    pd.DataFrame(np.rot90(eye)) != 1.0
].fillna(value=0)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,48.0,65.0,68.0,68.0,10.0,84.0,22.0,37.0,0.0
1,71.0,0.0,89.0,13.0,59.0,66.0,40.0,88.0,0.0,89.0
2,82.0,38.0,0.0,78.0,73.0,10.0,21.0,0.0,70.0,80.0
3,48.0,65.0,83.0,0.0,50.0,30.0,0.0,20.0,15.0,40.0
4,33.0,66.0,10.0,58.0,0.0,0.0,75.0,24.0,36.0,76.0
5,56.0,29.0,35.0,1.0,0.0,0.0,54.0,6.0,39.0,18.0
6,80.0,5.0,43.0,0.0,32.0,2.0,0.0,42.0,58.0,36.0
7,12.0,47.0,0.0,92.0,1.0,15.0,54.0,0.0,43.0,85.0
8,76.0,0.0,7.0,69.0,48.0,4.0,77.0,53.0,0.0,16.0
9,0.0,59.0,24.0,80.0,14.0,86.0,49.0,50.0,70.0,0.0


#### 32. (L2) 데이터를 첫번째 컬럼(과일명) 기준으로 group화 한다.

In [23]:
np.random.seed(0)
df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,
                   'col2': np.random.rand(9),
                   'col3': np.random.randint(0, 15, 9)})

In [None]:
df_grouped = df.groupby(['col1'])
df_grouped.get_group('apple')

Unnamed: 0,col1,col2,col3
0,apple,0.548814,10
3,apple,0.544883,7
6,apple,0.437587,8


#### 33. (L2) 바나나에 대하여 2번째로 비싼 가격을 출력한다.

In [25]:
np.random.seed(0)
df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'rating': np.random.rand(9),
                   'price': np.random.randint(0, 15, 9)})
df.head()

Unnamed: 0,fruit,rating,price
0,apple,0.548814,10
1,banana,0.715189,1
2,orange,0.602763,6
3,apple,0.544883,7
4,banana,0.423655,7


In [None]:
df.groupby(['fruit']).get_group('banana')['price'].sort_values(ascending=False).values[1]

1

#### 34. (L2) 두 개의 DataFrame을 결합한다.

In [29]:
np.random.seed(0)
df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.random.randint(0, 15, 9)})
df1.head()

Unnamed: 0,fruit,weight,price
0,apple,high,12
1,banana,medium,5
2,orange,low,0
3,apple,high,3
4,banana,medium,11


In [30]:
np.random.seed(0)
df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
                    'kilo': ['high', 'low'] * 3,
                    'price': np.random.randint(0, 15, 6)})
df2.head()

Unnamed: 0,pazham,kilo,price
0,apple,high,12
1,orange,low,5
2,pine,high,0
3,apple,low,3
4,orange,high,11


In [None]:
df1.merge(df2, left_on=['fruit', 'weight'], right_on=['pazham', 'kilo'], how='inner', suffixes=['_df1', '_df2'])

Unnamed: 0,fruit,weight,price_df1,pazham,kilo,price_df2
0,apple,high,12,apple,high,12
1,apple,high,3,apple,high,12
2,apple,high,7,apple,high,12
3,orange,low,0,orange,low,5
4,orange,low,3,orange,low,5
5,orange,low,3,orange,low,5


#### 35. (L2) 하나로 연결되어 있는 데이터를 3개의 열로 분할한다

In [32]:
df = pd.DataFrame(["STD, City State",
"33, Kolkata Bengal",
"44, Chennai Nadu",
"40, Hyderabad Telengana",
"80, Bangalore Karnataka"], columns=['row'])
df.head()

Unnamed: 0,row
0,"STD, City State"
1,"33, Kolkata Bengal"
2,"44, Chennai Nadu"
3,"40, Hyderabad Telengana"
4,"80, Bangalore Karnataka"


In [None]:
split = df.row.str.replace(',', ' ').str.split(' ', expand=True)
split.columns = split.iloc[0]
split = split[1:]
split

Unnamed: 0,STD,Unnamed: 2,City,State
1,33,,Kolkata,Bengal
2,44,,Chennai,Nadu
3,40,,Hyderabad,Telengana
4,80,,Bangalore,Karnataka
