##### Crawl (雪球) XueQiu Investor Portfolio Rebalance History Data and Do Simple Analysis


* [雪球(XueQiu)](https://xueqiu.com/) is one of the most popular stock information sharing forum in China.
* In XueQiu there has hundreds of thousands portfolios (A share stock), and investor reblance a portfolio from time to time. 
* In this assignment, you are asked to crawl each portfolio's entire rebalance history (from the time when the portfolio is created and the latest rebalance) and do some simple analysis.
* This is meaningful since the decision of rebalance action comes from investor's personal experience and investment philosophy. The entire rebalance history of a successful portfolio (high return low variance) can help us learn how to do investment. Besides, we can also simply follow the successful portfolios and do investment accordingly, and this also requires us to be able track portfolios' rebalance history.
* You may have to register an account and login first so that you can find the entire reblance history.
* There has 3 tasks in this notebook, introduced below:

## Task 1: Crawl XueQiu User's Rebalance History

* Each portfolio has an id. For example below portfolio is associated with id `ZH010218`. 

* In XueQiu user's page, you may see below figures, which records user's rebalance history and the corresponding profit curve.

Time Series Profit              | Investment Portfolio  | 
:-------------------------:|:-------------------------:|
<img src="xueqiu1.jpeg" width = "300" height = "550"/>  |  <img src="xueqiu2.jpeg" width = "300" height = "550"/>  | 

* Crawl each portfolio's entire rebalance history.
* Return a dictionary in following structure:

>```Python
{
    portfolio1_id:
    {
        time_1: {'cash_value':val, 'position':{stock1_symbol:{'volume':val, 'price':val}, stock2_symbol:...}}, 
        time_2: {'cash_value':val, 'position':{stock3_symbol:{'volume':val, 'price':val}, stock4_symbol:...}},
        ...
        time_n:
    }
    portfolio_id:
    {
        ...
    }
}
```

**Explanation**

* **portfolio_id**: introduced above.
* **time**: the time when each rebalance is done. 
* **cash_value**: portfolio's cash value. It is made up of two parts, `cash` + `holding stocks' cash value`, which can be denoted in following equation, where $n$ denotes the number of stocks user is holding at hand in one rebalance. 
\begin{equation}
cash\_value = cash + \sum\limits_{i=1}^n price_i * volume_i
\end{equation}
* **stock_symbol**: each stock associates with a symbol, for example in above figure, "民生银行" has symbol "SH600016".
* **volume**: number of shares of stocks in investor's portfolio when (s)he performs rebalance.
* **price**: the stock's price when investor performs rebalance.

## Task 2: 

* Return the portfolio id which has the highest profit between "ZH010000" and "ZH020000".

## Task 3:

* Shows the portfolio id with highest return and with the following two constraints: (1) the latest rebalance happens after May 1st, 2018, and (2) the rebalance history lasts more than 2 years.

## Return

* For Task 1, extract all the portfolio ranged from **ZH010000** (included) and **ZH020000** (excluded) and orgnize the data as the format of above dictionary. Save it as a pickle file, named as **portfolio.pkl** under current directory.

## Put your source code below

### Task 1: Save the pickle file in current directory.

In [26]:
import requests
import time
import pickle
# import re
from bs4 import BeautifulSoup
# import json
import numpy as np
import pandas as pd

In [2]:
# ZH010000(included) and ZH020000(excluded)
portfolio_id_list=['ZH0'+str(num) for num in range(10000,10050)]

In [3]:
def get_source_json(portfolio_id):
    portfolio_id = portfolio_id
    request_json = 'https://xueqiu.com/cubes/rebalancing/history.json'
    cookies = {'xqat': 'a52fa8edc1186cd5fb962b24faa8f04b2242c652',
               }
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) '
                             'AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/66.0.3359.181 Safari/537.36',
               }
    params = (
        ('cube_symbol', portfolio_id),
        ('count', '20'),
        ('page', '1'),
    )
    response = requests.get(request_json, headers=headers, params=params, cookies=cookies)
    return response.json(),portfolio_id


def timestamp_to_date(time_stamp, format_string="%Y-%m-%d %H:%M:%S"):
    time_stamp = int(round(time_stamp * (10 ** (10 - len(str(time_stamp))))))
    time_array = time.localtime(time_stamp)
    str_date = time.strftime(format_string, time_array)
    return str_date


def get_target_json(data_json, portfolio_id):
    rebalance_history = dict()
    rebalance_history[portfolio_id] = {}
    for history in data_json['list']:
        update_time = timestamp_to_date(history['updated_at'])
        cash_value = history['cash_value']
        rebalance_history[portfolio_id][update_time] = {'cash_value': cash_value}
        rebalance_history[portfolio_id][update_time]['position'] = dict()
        for stock in history['rebalancing_histories']:
            stock_symbol = stock['stock_symbol']
            stock_volumn = stock['volume']
            stock_price = stock['price']
            rebalance_history[portfolio_id][update_time]['position'][stock_symbol] = {'volume': stock_volumn,
                                                                                      'price': stock_price}
    return rebalance_history




res=dict()
for portfolio_id in portfolio_id_list:
    print('processing',portfolio_id,"...")
    source_json = get_source_json(portfolio_id)
    try:
        target_json=get_target_json(source_json[0], source_json[1])
        res=dict( res, **target_json )
    except KeyError:
        pass


with open('portfolio.pkl', 'wb') as f:
    pickle.dump(res, f)
    
print('Finished')

processing ZH010000 ...
processing ZH010001 ...
processing ZH010002 ...
processing ZH010003 ...
processing ZH010004 ...
processing ZH010005 ...
processing ZH010006 ...
processing ZH010007 ...
processing ZH010008 ...
processing ZH010009 ...
processing ZH010010 ...
processing ZH010011 ...
processing ZH010012 ...
processing ZH010013 ...
processing ZH010014 ...
processing ZH010015 ...
processing ZH010016 ...
processing ZH010017 ...
processing ZH010018 ...
processing ZH010019 ...
processing ZH010020 ...
processing ZH010021 ...
processing ZH010022 ...
processing ZH010023 ...
processing ZH010024 ...
processing ZH010025 ...
processing ZH010026 ...
processing ZH010027 ...
processing ZH010028 ...
processing ZH010029 ...
processing ZH010030 ...
processing ZH010031 ...
processing ZH010032 ...
processing ZH010033 ...
processing ZH010034 ...
processing ZH010035 ...
processing ZH010036 ...
processing ZH010037 ...
processing ZH010038 ...
processing ZH010039 ...
processing ZH010040 ...
processing ZH010

In [65]:
res

{'ZH010000': {'2018-05-30 09:30:40': {'cash_value': 0.33101316,
   'position': {'SH600173': {'volume': 0.02252427, 'price': 4.93}}},
  '2018-04-25 09:30:38': {'cash_value': 0.33101316,
   'position': {'SZ000888': {'volume': 0.01467176, 'price': 9.11}}},
  '2017-07-17 09:30:31': {'cash_value': 0.33101316,
   'position': {'SH600775': {'volume': 0.01308936, 'price': 10.71}}},
  '2017-07-07 09:30:20': {'cash_value': 0.33101316,
   'position': {'SH600599': {'volume': 0.0038169, 'price': 24.0}}},
  '2017-06-26 09:30:38': {'cash_value': 0.33101316,
   'position': {'SH600580': {'volume': 0.01442289, 'price': 6.82}}},
  '2017-06-02 09:30:34': {'cash_value': 0.33101316,
   'position': {'SZ000888': {'volume': 0.01451246, 'price': 11.25}}},
  '2017-04-18 09:30:16': {'cash_value': 0.33101316,
   'position': {'SH600173': {'volume': 0.02207647, 'price': 11.43}}},
  '2016-07-15 09:30:13': {'cash_value': 0.33101316,
   'position': {'SH600775': {'volume': 0.01300436, 'price': 16.96}}},
  '2016-06-22 09:

### Task 2: 

In [45]:
# get  return
def get_return_rate(portfolio_id):
    cookies = {
        'xqat': 'a52fa8edc1186cd5fb962b24faa8f04b2242c652',
    }

    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',

    }

    response = requests.get('https://xueqiu.com/P/'+portfolio_id, headers=headers, cookies=cookies)
    soup = BeautifulSoup(response.content, 'html.parser')
    return_rate_tag = soup.select('.cube-profit-year')
    for tag in return_rate_tag:
        return {portfolio_id : float(tag.span.text)}
    
return_list=dict()
for portfolio_id in portfolio_id_list:
    print('processing',portfolio_id,"...")
    rate=get_return_rate(portfolio_id)
    try:
        return_list = dict(return_list, **rate)
    except TypeError:
        pass
print('finished')

processing ZH010000 ...
processing ZH010001 ...
processing ZH010002 ...
processing ZH010003 ...
processing ZH010004 ...
processing ZH010005 ...
processing ZH010006 ...
processing ZH010007 ...
processing ZH010008 ...
processing ZH010009 ...
processing ZH010010 ...
processing ZH010011 ...
processing ZH010012 ...
processing ZH010013 ...
processing ZH010014 ...
processing ZH010015 ...
processing ZH010016 ...
processing ZH010017 ...
processing ZH010018 ...
processing ZH010019 ...
processing ZH010020 ...
processing ZH010021 ...
processing ZH010022 ...
processing ZH010023 ...
processing ZH010024 ...
processing ZH010025 ...
processing ZH010026 ...
processing ZH010027 ...
processing ZH010028 ...
processing ZH010029 ...
processing ZH010030 ...
processing ZH010031 ...
processing ZH010032 ...
processing ZH010033 ...
processing ZH010034 ...
processing ZH010035 ...
processing ZH010036 ...
processing ZH010037 ...
processing ZH010038 ...
processing ZH010039 ...
processing ZH010040 ...
processing ZH010

In [46]:
return_list

{'ZH010000': -16.34,
 'ZH010001': 3.0,
 'ZH010002': -7.97,
 'ZH010003': 0.0,
 'ZH010004': 105.49,
 'ZH010005': -29.55,
 'ZH010006': 52.66,
 'ZH010007': -30.1,
 'ZH010008': 4.22,
 'ZH010009': 149.72,
 'ZH010010': -11.99,
 'ZH010011': 30.31,
 'ZH010013': 44.18,
 'ZH010014': 123.72,
 'ZH010015': -17.37,
 'ZH010016': 258.28,
 'ZH010017': 41.01,
 'ZH010018': 84.83,
 'ZH010019': -37.96,
 'ZH010020': 5.85,
 'ZH010021': 104.29,
 'ZH010022': -61.63,
 'ZH010023': 36.13,
 'ZH010024': 134.86,
 'ZH010025': 121.0,
 'ZH010026': 54.24,
 'ZH010027': 15.62,
 'ZH010028': 1.53,
 'ZH010029': -19.81,
 'ZH010030': 66.13,
 'ZH010031': -24.74,
 'ZH010032': -28.32,
 'ZH010033': -49.89,
 'ZH010034': 41.05,
 'ZH010035': 25.25,
 'ZH010036': -60.36,
 'ZH010037': 2.03,
 'ZH010038': 57.78,
 'ZH010039': -10.8,
 'ZH010040': 4.04}

In [64]:
return_series=pd.Series(list(return_list.values()),index=return_list.keys())
np.argmax(return_series)

'ZH010016'

### Task 3: