## 네이버 트렌드 클롤링

### 크롤링 목적
#### 카테고리별 7일단위 상위 20개의 검색어를 NLP를 사용하여 '상품명'과 비교한다

### 분류기준
#### 1. 카테고리
#### 2. 성별(남/여) 모두
#### 3. 연령 (40, 50, 60대) 

### 크롤링 사용 플랜
#### 1. previous 6 days + current day = 7 days
#### 2. crawl Top 20
#### 3. make a list of words for each day
#### 4. use NLP to find distance between '상품명'
#### 5. use the distance as a new feature


In [199]:
import requests
import pandas as pd
from datetime import datetime, timedelta

In [200]:
url = "https://datalab.naver.com/shoppingInsight/getCategoryKeywordRank.naver"
headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36",
    "x-requested-with": "XMLHttpRequest"
}
data = {
    "cid": "50000000",
    "timeUnit": "date",
    "startDate": "2019-01-01",
    "endDate": "2019-01-07",
    "age": [40,50],
    "gender": "",
    "device": "",
    "page": "1",
    "count": "20"
}

In [201]:
df = pd.DataFrame(columns = ["date", "keyWords"])
df = df.iloc[0:0]

In [202]:
# add data for 1/1/2019 - 1/6/2019
for i in range(6):
    startDate = datetime(2019,1,1)
    endDate = startDate + timedelta(days=i)
    data["startDate"] = startDate.strftime('%Y') + "-" + startDate.strftime('%m') + "-" + startDate.strftime('%d')
    data["endDate"] = endDate.strftime('%Y') + "-" + endDate.strftime('%m') + "-" + endDate.strftime('%d')
    response = requests.post(url, headers=headers, data=data).json()
    ranks = response["ranks"] 
    keyWords = []
    for item in ranks:
        keyWords.append(item["keyword"])
        
    df = df.append({'date': endDate, 'keyWords': keyWords}, ignore_index=True)
      

In [203]:
df

Unnamed: 0,date,keyWords
0,2019-01-01,"[롱패딩, 여성롱패딩, 남자롱패딩, 키즈롱패딩, 여자롱패딩, 몽클레어여성패딩, 여자..."
1,2019-01-02,"[롱패딩, 여성롱패딩, 남자롱패딩, 키즈롱패딩, 여자롱패딩, 몽클레어여성패딩, 여성..."
2,2019-01-03,"[롱패딩, 여성롱패딩, 남자롱패딩, 키즈롱패딩, 여자롱패딩, 몽클레어여성패딩, 여성..."
3,2019-01-04,"[롱패딩, 여성롱패딩, 남자롱패딩, 키즈롱패딩, 몽클레어여성패딩, 여자롱패딩, 여성..."
4,2019-01-05,"[롱패딩, 여성롱패딩, 남자롱패딩, 몽클레어여성패딩, 키즈롱패딩, 여자롱패딩, 니트..."
5,2019-01-06,"[롱패딩, 여성롱패딩, 남자롱패딩, 몽클레어여성패딩, 키즈롱패딩, 니트원피스, 여자..."


In [204]:
startDate = datetime(2019,1,1)
endDate = datetime(2019,1,7)
keyWords = []

In [232]:
while endDate.year < 2020: 
    data["startDate"] = startDate.strftime('%Y') + "-" + startDate.strftime('%m') + "-" + startDate.strftime('%d')
    data["endDate"] = endDate.strftime('%Y') + "-" + endDate.strftime('%m') + "-" + endDate.strftime('%d')
    response = requests.post(url, headers=headers, data=data).json()
    ranks = response["ranks"]

    for item in ranks:
        keyWords.append(item["keyword"])
        
    
    df = df.append({'date': endDate, 'keyWords': keyWords}, ignore_index=True)
    
    keyWords = []
    startDate += timedelta(days=1)
    endDate += timedelta(days=1)        
    

In [235]:
df

Unnamed: 0,date,keyWords
0,2019-01-01,"[롱패딩, 여성롱패딩, 남자롱패딩, 키즈롱패딩, 여자롱패딩, 몽클레어여성패딩, 여자..."
1,2019-01-02,"[롱패딩, 여성롱패딩, 남자롱패딩, 키즈롱패딩, 여자롱패딩, 몽클레어여성패딩, 여성..."
2,2019-01-03,"[롱패딩, 여성롱패딩, 남자롱패딩, 키즈롱패딩, 여자롱패딩, 몽클레어여성패딩, 여성..."
3,2019-01-04,"[롱패딩, 여성롱패딩, 남자롱패딩, 키즈롱패딩, 몽클레어여성패딩, 여자롱패딩, 여성..."
4,2019-01-05,"[롱패딩, 여성롱패딩, 남자롱패딩, 몽클레어여성패딩, 키즈롱패딩, 여자롱패딩, 니트..."
...,...,...
360,2019-12-27,"[여성숏패딩, 핸드메이드코트, 원피스, 여성패딩, 롱패딩, 여성롱패딩, 니트원피스,..."
361,2019-12-28,"[여성숏패딩, 핸드메이드코트, 원피스, 여성패딩, 롱패딩, 여성롱패딩, 니트원피스,..."
362,2019-12-29,"[여성숏패딩, 핸드메이드코트, 원피스, 여성패딩, 롱패딩, 여성롱패딩, 니트원피스,..."
363,2019-12-30,"[여성숏패딩, 핸드메이드코트, 원피스, 여성패딩, 롱패딩, 여성롱패딩, 니트원피스,..."


In [237]:
df.to_csv(r'의류트렌드키워드.csv', index = False)

In [239]:
df.iloc[0]["keyWords"]

['롱패딩',
 '여성롱패딩',
 '남자롱패딩',
 '키즈롱패딩',
 '여자롱패딩',
 '몽클레어여성패딩',
 '여자무스탕',
 '여성패딩',
 '니트원피스',
 '원피스',
 '핸드메이드코트',
 '밍크코트',
 '써스데이아일랜드',
 '아디다스롱패딩',
 '남자패딩',
 '아동롱패딩',
 '무스탕',
 '패딩',
 '여성경량패딩',
 '여성코트']