# 채용 공고 추천  
개발자가 해당 채용공고에 지원할지 안 할지를 예측하는 Binary Classifier를 만든다.  
#### ※ 주의사항  
코드 저작권 소유자 : 나  
데이터 저작권 소유자 : 그렙  


In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

In [2]:
# 데이터 불러오기
df_train=pd.read_csv('./data/train.csv')
df_job_tags=pd.read_csv('./data/job_tags.csv')
df_user_tags=pd.read_csv('./data/user_tags.csv')
df_tags=pd.read_csv('./data/tags.csv')
df_job_companies=pd.read_csv('./data/job_companies.csv')
df_test=pd.read_csv('./data/test.csv')

데이터를 살펴보며 특징들을 파악한다.

In [3]:
# train.csv : 사용자, 사용자가 본 공고, 지원여부
print(f"공고를 본 유저 수 : {len(df_train.userID.unique())}명")
print(f"열람된 공고 종류  : {len(df_train.jobID.unique())}개")
df_train.tail(3)

공고를 본 유저 수 : 196명
열람된 공고 종류  : 708개


Unnamed: 0,userID,jobID,applied
5997,3ab88dd28f749fe4ec90c0b6f9896eb5,e2a2dcc36a08a345332c751b2f2e476c,0
5998,75b4af0dacbc119eadf4eeb096738405,3b712de48137572f3849aabd5666a4e3,0
5999,67adefb430df142b099bed89bd491524,65cc2c8205a05d7379fa3a6386f710e1,0


In [4]:
# job_tags.csv : 공고, 기술태그.
print(f"요구 기술 종류 : {len(df_job_tags.tagID.unique())}개")
df_job_tags.tail(3)

요구 기술 종류 : 240개


Unnamed: 0,jobID,tagID
3474,6c8dba7d0df1c4a79dd07646be9a26c8,0e9fa1f3e9e66792401a6972d477dcc3
3475,6c8dba7d0df1c4a79dd07646be9a26c8,0c048b3a434e49e655c1247efb389cec
3476,9f36407ead0629fc166f14dde7970f68,c4851e8e264415c4094e4e85b0baa7cc


In [5]:
# user_tags.csv : 사용자, 기술
print(f"기술을 가진 유저의 수 : {len(df_user_tags.userID.unique())}명") # 모든 유저가 최소 하나의 기술을 갖는다.
print(f"유저가 익힌 기술 종류 : {len(df_user_tags.tagID.unique())}개")
df_user_tags.tail(3)

기술을 가진 유저의 수 : 196명
유저가 익힌 기술 종류 : 345개


Unnamed: 0,userID,tagID
17191,3ab88dd28f749fe4ec90c0b6f9896eb5,f47330643ae134ca204bf6b2481fec47
17192,15d84e9a5eceb67bcb8fb0e8c839a903,285f89b802bcb2651801455c86d78f2a
17193,3fb6224c45e07abd01a213f707d2219b,7d771e0e8f3633ab54856925ecdefc5d


In [6]:
# tags.csv : 존재하는 모든 기술태그id, 태그의 실제 내용

print(f"기술태그 종류 : {len(df_tags.keyword.unique())}개")
print(f"태그 ID  종류 : {len(df_tags.tagID.unique())}개")
df_tags.head(3)

기술태그 종류 : 887개
태그 ID  종류 : 887개


Unnamed: 0,tagID,keyword
0,602d1305678a8d5fdb372271e980da6a,Amazon Web Services(AWS)
1,e3251075554389fe91d17a794861d47b,Tensorflow
2,a1d50185e7426cbb0acad1e6ca74b9aa,Docker


In [7]:
# job_companies.csv : 회사, 회사가 낸 공고, 회사 규모

print(f"회사 종류  : {len(df_job_companies.companyID.unique())}개")
print('회사 규모 종류 : {}'.format(df_job_companies.companySize.unique()))
print(f"전체 공고 종류  : {len(df_job_companies.jobID.unique())}개")
df_job_companies.tail(3)

회사 종류  : 276개
회사 규모 종류 : [nan '11-50' '101-200' '1-10' '51-100' '1000 이상' '201-500' '501-1000']
전체 공고 종류  : 733개


Unnamed: 0,companyID,jobID,companySize
730,443cb001c138b2561a0d90720d6ce111,d81f9c1be2e08964bf9f24b15f0e4900,
731,b5b41fac0361d157d9673ecb926af5ae,ae0eb3eed39d2bcef4622b2499a05fe6,
732,64223ccf70bbb65a3a4aceac37e21016,912d2b1c7b2826caf99687388d2e8f7c,1-10


In [8]:
taglist=pd.concat([df_job_tags.tagID, df_user_tags.tagID])
taglist=taglist.unique()
print(f"실제 사용되는 기술 종류 : {len(taglist)}")

실제 사용되는 기술 종류 : 419


#### 파악한 내용
- 새로운 유저, 새로운 공고는 들어오지 않는다.  
- 존재하는 기술 중 익혀지지도, 요구되지도 않는 기술이 많다.  
- 기술태그를 실제 이름으로 바꾸는 일은 필요 없다.  
  
#### 이제 무엇을 해야하는가  
1. 유저id, 보유 기술(419종류, 원핫인코딩)을 포함하는 DataFrame 생성  
2. 공고id, 요구 기술, 회사규모를 포함하는 DataFrame 생성  
3. 데이터(유저, 공고 쌍)는 전처리([보유기술, 요구기술, 회사규모] 추가, [유저코드, 공고코드] 삭제) 후 학습시킨다.

In [9]:
# 유저별 기술 정보 생성
df_user=pd.DataFrame(df_train.userID.unique())
df_user.columns=["userID"]
df_user[taglist]=pd.DataFrame([[0]*len(taglist)]*len(df_user))

# 유저-태그 관계에 중복이 많이 있는데,
# 유저가 해당 기술에 관심이 많다는 뜻으로 볼 수 있는가?
# 중복 삭제, 중복 허용을 모두 해본 결과, 중복 허용의 결과가 약간(0.1%p) 더 좋게 나왔다.
#df_user_tags=df_user_tags.drop_duplicates().reset_index(drop=True)  # 주석 해제시 중복 제거
for _, row in df_user_tags.iterrows():
    df_user.loc[df_user.userID==row.userID, row.tagID]+=1

df_user

Unnamed: 0,userID,d38901788c533e8286cb6400b40b386d,3948ead63a9f2944218de038d8934305,0e095e054ee94774d6a496099eb1cf6a,7d771e0e8f3633ab54856925ecdefc5d,6c8dba7d0df1c4a79dd07646be9a26c8,4da04049a062f5adfe81b67dd755cecc,39e4973ba3321b80f37d9b55f63ed8b8,8a3363abe792db2d8761d6403605aeb7,a1d50185e7426cbb0acad1e6ca74b9aa,...,959a557f5f6beb411fd954f3f34b21c3,3d2d8ccb37df977cb6d9da15b76c3f3a,291597a100aadd814d197af4f4bab3a7,67e103b0761e60683e83c559be18d40c,fae0b27c451c728867a567e8c1bb4e53,5fa9e41bfec0725742cc9d15ef594120,07563a3fe3bbe7e3ba84431ad9d055af,515ab26c135e92ed8bf3a594d67e4ade,d82118376df344b0010f53909b961db3,9ad6aaed513b73148b7d49f70afcfb32
0,fe292163d06253b716e9a0099b42031d,0,12,0,0,0,0,0,0,12,...,0,0,0,0,0,0,0,0,0,0
1,6377fa90618fae77571e8dc90d98d409,0,0,5,0,0,0,5,0,0,...,0,0,0,0,0,0,0,0,0,0
2,8ec0888a5b04139be0dfe942c7eb4199,0,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,f862b39f767d3a1991bdeb2ea1401c9c,0,5,5,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,cac14930c65d72c16efac2c51a6b7f71,0,0,5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
191,66480bd2955f9663eff79f679c096733,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
192,570e72724ec4b76760248ccdae0449f8,0,0,11,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
193,ac9e4248f16d319a00b803477db2433a,0,5,5,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
194,015b469419f616144c13e0194f880af7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# 공고별 요구기술, 회사크기 정보 생성
df_job=pd.DataFrame(df_job_companies.jobID.unique())
df_job.columns=["jobID"]
df_job[taglist]=pd.DataFrame([[0]*len(taglist)]*len(df_job))

for _, row in df_job_tags.iterrows():
    df_job.loc[df_job.jobID==row.jobID, row.tagID]=1

# 회사 규모도 원핫인코딩하여 추가
df_job["companySize"]=df_job_companies.companySize
dummies=pd.get_dummies(df_job["companySize"], dummy_na=True)
df_job=pd.concat([df_job, dummies], axis=1)
df_job=df_job.drop("companySize", axis=1)

df_job

Unnamed: 0,jobID,d38901788c533e8286cb6400b40b386d,3948ead63a9f2944218de038d8934305,0e095e054ee94774d6a496099eb1cf6a,7d771e0e8f3633ab54856925ecdefc5d,6c8dba7d0df1c4a79dd07646be9a26c8,4da04049a062f5adfe81b67dd755cecc,39e4973ba3321b80f37d9b55f63ed8b8,8a3363abe792db2d8761d6403605aeb7,a1d50185e7426cbb0acad1e6ca74b9aa,...,d82118376df344b0010f53909b961db3,9ad6aaed513b73148b7d49f70afcfb32,1-10,1000 이상,101-200,11-50,201-500,501-1000,51-100,NaN
0,e5f6ad6ce374177eef023bf5d0c018b6,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,185e65bc40581880c4f2c82958de8cfe,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0537fb40a68c18da59a35c2bfe1ca554,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,b7ee6f5f9aa5cd17ca1aea43ce848496,0,0,1,1,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,efe937780e95574250dabe07151bdc23,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
728,fa3a3c407f82377f55c19c5d403335c7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
729,d7a728a67d909e714c0774e22cb806f2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
730,d81f9c1be2e08964bf9f24b15f0e4900,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
731,ae0eb3eed39d2bcef4622b2499a05fe6,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [11]:
# 머신러닝 모델 (NLP)

model=tf.keras.Sequential([
    tf.keras.layers.Dense(500, input_dim=846, activation="relu"),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer=tf.keras.optimizers.Adam(0.01), loss='binary_crossentropy', metrics=['accuracy'])
early_stopping=tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

In [12]:
# 훈련 데이터 전처리
x_train=df_train.iloc[:,:2]
x_train=x_train.merge(df_user, left_on="userID", right_on="userID").merge(df_job, left_on="jobID", right_on="jobID").iloc[:,2:]
y_train=df_train.applied

# 0~5000 : train, 5001~6000 : val로 사용

In [13]:
# 모델 학습시키기
model.fit(x_train[:5000], y_train[:5000], validation_data=(x_train[5000:], y_train[5000:]), callbacks=[early_stopping]
         , batch_size=50, epochs=100, verbose=2)

Epoch 1/100
100/100 - 1s - loss: 0.5760 - accuracy: 0.8336 - val_loss: 0.5026 - val_accuracy: 0.8590
Epoch 2/100
100/100 - 0s - loss: 0.4779 - accuracy: 0.8464 - val_loss: 0.4706 - val_accuracy: 0.8520
Epoch 3/100
100/100 - 0s - loss: 0.4591 - accuracy: 0.8486 - val_loss: 0.4455 - val_accuracy: 0.8570
Epoch 4/100
100/100 - 0s - loss: 0.4400 - accuracy: 0.8496 - val_loss: 0.4774 - val_accuracy: 0.8520
Epoch 5/100
100/100 - 0s - loss: 0.4365 - accuracy: 0.8502 - val_loss: 0.4809 - val_accuracy: 0.8460
Epoch 6/100
100/100 - 0s - loss: 0.4334 - accuracy: 0.8520 - val_loss: 0.4533 - val_accuracy: 0.8540
Epoch 7/100
100/100 - 0s - loss: 0.4087 - accuracy: 0.8548 - val_loss: 0.4635 - val_accuracy: 0.8550
Epoch 8/100
100/100 - 0s - loss: 0.3994 - accuracy: 0.8590 - val_loss: 0.4594 - val_accuracy: 0.8470
Epoch 9/100
100/100 - 0s - loss: 0.4009 - accuracy: 0.8584 - val_loss: 0.4819 - val_accuracy: 0.8550
Epoch 10/100
100/100 - 0s - loss: 0.3972 - accuracy: 0.8602 - val_loss: 0.4950 - val_accura

<keras.callbacks.History at 0x213a43e2f60>

In [14]:
# 테스트 데이터 전처리
x_test=df_test.iloc[:,:2]
x_test=x_test.merge(df_user, left_on="userID", right_on="userID").merge(df_job, left_on="jobID", right_on="jobID").iloc[:,2:]

In [15]:
# 공고 지원 여부 예측
y_pred=pd.DataFrame(model.predict(x_test))
y_pred["applied"]=y_pred[0].round().astype(int)

In [16]:
# 파일로 내보내기
y_pred.drop([0], axis=1).to_csv("./data/applied_pred.csv", index=False)