Have you ever had to answer this question at least once when you came home from work? As for me — yes, and more than once. From Netflix to Hulu, the need to build robust movie recommendation systems is extremely important given the huge demand for personalized content of modern consumers.

An example of recommendation system is such as this:

User A watches Game of Thrones and Breaking Bad.
User B does search on Game of Thrones, then the system suggests Breaking Bad from data collected about user A.
Recommendation systems are used not only for movies, but on multiple other products and services like Amazon (Books, Items), Pandora/Spotify (Music), Google (News, Search), YouTube (Videos) etc.

Two most ubiquitous types of personalized recommendation systems are Content-Based and Collaborative Filtering. Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, that is it uses the “wisdom of the crowd” to recommend items. In contrast, content-based recommendation systems focus on the attributes of the items and give you recommendations based on the similarity between them.

In this notebook, I will attempt at implementing these two systems to recommend movies and evaluate them to see which one performs better.

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [14]:
%%time
info_content_df = pd.read_csv('data/Info_Content.csv', index_col='ucid')
info_userdata_df = pd.read_csv('data/Info_UserData.csv', index_col='uuid')
log_problem_df = pd.read_csv('data/Log_Problem.csv')

Wall time: 36.1 s


In [15]:
# scale down some records
log_problem_df = log_problem_df[log_problem_df['timestamp_TW'] < '2019-01-01']

In [16]:
merge_log_problem_info_userdata_df = log_problem_df.merge(info_userdata_df, how='left', on = 'uuid')
merge_log_problem_info_userdata_info_content_df = merge_log_problem_info_userdata_df.merge(info_content_df, how='left', on = 'ucid')

In [17]:
merge_log_problem_info_userdata_info_content_df.head()

Unnamed: 0,timestamp_TW,uuid,ucid,upid,problem_number,exercise_problem_repeat_session,is_correct,total_sec_taken,total_attempt_cnt,used_hint_cnt,...,has_class_cnt,content_pretty_name,content_kind,difficulty,subject,learning_stage,level1_id,level2_id,level3_id,level4_id
0,2018-09-28 20:00:00 UTC,Kpq2q+eKw/O+6/jLs3XJosgmI7weEJxJZdnkKTbbF8I=,Ps4dfShfpeMF3VG030HqZ2bsbD7PaVxvJYFTtroeSzQ=,ZmKEZ0F2WFqhlL7KFfJcHEnZCZu0e4p+CVG5rSlyKYk=,2,1,True,11,1,0,...,1,【基礎】倍數的應用,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,7f73q332BKPBXaixasa4EkUb+pF6VAsLxNIg4506JJs=,ItasYR+er/FlZlRvL66/NB3wY0AvmlrZKoqe4gmPyD0=,VHYt8IeoqiIPVsB+32JAhIjK9jU+pnr11fL80QshARI=
1,2018-09-28 10:15:00 UTC,0+VU/Zb0Q96uoByuRhl7r9bJuJO6CKWpsmNMEuijSzc=,/d39FzqaM3PZzpoMXxA80PMICsVhzfL6MGSCqZtsQOo=,tO9dyvadKWMVQgEx/BXtRIYJ2TRJFQgwvcsBwFb4+xI=,6,1,True,26,1,0,...,0,【基礎】找出最小公倍數,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,7f73q332BKPBXaixasa4EkUb+pF6VAsLxNIg4506JJs=,ItasYR+er/FlZlRvL66/NB3wY0AvmlrZKoqe4gmPyD0=,VHYt8IeoqiIPVsB+32JAhIjK9jU+pnr11fL80QshARI=
2,2018-09-05 20:00:00 UTC,g8DnYvIqpolw10XlwWeIWv6NbDPByUbmgH8EshJqBns=,YuGOmB+frbM8rfAa0RJE882R+IoMf9N89OiVqLbAHBw=,6Lxz6aXvgyw3vZd3v8g6jgoCRDPOQzVPx/dnEC0o7DQ=,4,1,True,78,1,0,...,0,【基礎】尋找質因數,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,7f73q332BKPBXaixasa4EkUb+pF6VAsLxNIg4506JJs=,ItasYR+er/FlZlRvL66/NB3wY0AvmlrZKoqe4gmPyD0=,DoAefIneFglvkxJ4Jb8VyB8JrESm9UEHtGyV4MqiwCo=
3,2018-09-14 16:30:00 UTC,kSyUTFlepsYUD723IPL/jEZ520xaKbscrBmNtBUFR1o=,BG1RsWojzEHzV28RBm/1iKi1NyZgDcDomLYEJSV6lmo=,1fIjdakTApQp5PfWog87uOmM6JuoNE/oQq2y5/fMmfw=,3,1,True,7,1,0,...,0,【基礎】數的相關名詞介紹,Exercise,easy,math,junior,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,xYDz4OEv0xsri1IpmXlrgMLJ848rgySf+39xWpq4DBI=,/yqeM1FRP1rB9WuQWBkStMqrBQgjEexaeyWIhBC7ov4=,Vuo8t3kw/4IH80FuZ0l0uJPwpfrMs8SxhCbJA8zn3vU=
4,2018-09-13 16:00:00 UTC,XMFbFA7C49+LRhUddhelfPpA6F5dbOoxeyL3eYbuTlY=,qPHR8aBqOhKij9IS/Y8IR8prwWruoDBGU1tVUhXDJkE=,8V/NT6M+er2I3V3ZIWRNo4Qbo3Iad89PHbeeZeoZeF0=,12,1,True,48,1,0,...,0,【基礎】大數的加減,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,7f73q332BKPBXaixasa4EkUb+pF6VAsLxNIg4506JJs=,scsWmkZsfmdmD2IzB24sQ1Au1BOXYgQEx9zO3+4glq8=,hq6uCe9NmtCc+0wlbGGIsxegP2cqYAdFebGd+v4/o8Q=


In [46]:
merge_log_problem_info_userdata_info_content_df.shape

(7197409, 34)

In [28]:
merge_log_problem_info_userdata_info_content_df.columns

Index(['timestamp_TW', 'uuid', 'ucid', 'upid', 'problem_number',
       'exercise_problem_repeat_session', 'is_correct', 'total_sec_taken',
       'total_attempt_cnt', 'used_hint_cnt', 'is_hint_used', 'is_downgrade',
       'is_upgrade', 'level', 'gender', 'points', 'badges_cnt',
       'first_login_date_TW', 'user_grade', 'user_city', 'has_teacher_cnt',
       'is_self_coach', 'has_student_cnt', 'belongs_to_class_cnt',
       'has_class_cnt', 'content_pretty_name', 'content_kind', 'difficulty',
       'subject', 'learning_stage', 'level1_id', 'level2_id', 'level3_id',
       'level4_id'],
      dtype='object')

In [35]:
x = merge_log_problem_info_userdata_info_content_df.groupby(['uuid','ucid'])['is_correct'].sum()

In [37]:
x.to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,is_correct
uuid,ucid,Unnamed: 2_level_1
++5bdNp/LZvGenJ8Brp4n2SfS9d4pu4qA7cF7FQW7hk=,VY6aXT7f64ny+uy4pszHVNSy3WHyoFPuhwToxBhB3wM=,5.0
++E4TrlDYvGtPBg1edhkLXLEEbnfiAgAamPQ33vpW8M=,Qx6mwirYKln7CTvOXad5Do5OkVKmypYSQfFs0MB6Cvs=,7.0
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,412PAnenNdYglQWXSlVtS1RYA7Yg60Wty166LCMaiHU=,8.0
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,HKqPgUALqZGw984KFqAMeDwDJTi9cNhJ51UXfjzELSg=,6.0
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,VFSfDdE2vLsyVfGCGYLnxYih+n8+IpUbRoiUpDJ4hc4=,5.0
...,...,...
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,EeAQRS+kGarrxqFWxO5U2lJ12zE7xgtIiHO9ojBNYpw=,5.0
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,HiffIWPAC1YJDwu8WRYatMOmxW/ufbs1/6A6HoWj1dU=,5.0
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,m6DnlS38gmb+QFf7102iqmQim9m7dKlTfZ7bPLN+Ato=,3.0
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,oAoQyh9NT0lXlshN6sYYNSMTizpgCVqITEKcEthrDZU=,12.0


In [41]:
pd.crosstab([merge_log_problem_info_userdata_info_content_df.uuid,merge_log_problem_info_userdata_info_content_df.ucid], merge_log_problem_info_userdata_info_content_df.is_correct).reset_index(level=1,drop=True)

is_correct,False,True
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1
++5bdNp/LZvGenJ8Brp4n2SfS9d4pu4qA7cF7FQW7hk=,0,5
++E4TrlDYvGtPBg1edhkLXLEEbnfiAgAamPQ33vpW8M=,3,7
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,0,8
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,3,6
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,0,5
...,...,...
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,0,5
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,1,5
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,0,3
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,2,12


In [115]:
x1 = pd.crosstab([merge_log_problem_info_userdata_info_content_df.uuid,merge_log_problem_info_userdata_info_content_df.ucid], merge_log_problem_info_userdata_info_content_df.is_correct)
x1

Unnamed: 0_level_0,is_correct,False,True
uuid,ucid,Unnamed: 2_level_1,Unnamed: 3_level_1
++5bdNp/LZvGenJ8Brp4n2SfS9d4pu4qA7cF7FQW7hk=,VY6aXT7f64ny+uy4pszHVNSy3WHyoFPuhwToxBhB3wM=,0,5
++E4TrlDYvGtPBg1edhkLXLEEbnfiAgAamPQ33vpW8M=,Qx6mwirYKln7CTvOXad5Do5OkVKmypYSQfFs0MB6Cvs=,3,7
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,412PAnenNdYglQWXSlVtS1RYA7Yg60Wty166LCMaiHU=,0,8
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,HKqPgUALqZGw984KFqAMeDwDJTi9cNhJ51UXfjzELSg=,3,6
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,VFSfDdE2vLsyVfGCGYLnxYih+n8+IpUbRoiUpDJ4hc4=,0,5
...,...,...,...
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,EeAQRS+kGarrxqFWxO5U2lJ12zE7xgtIiHO9ojBNYpw=,0,5
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,HiffIWPAC1YJDwu8WRYatMOmxW/ufbs1/6A6HoWj1dU=,1,5
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,m6DnlS38gmb+QFf7102iqmQim9m7dKlTfZ7bPLN+Ato=,0,3
zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,oAoQyh9NT0lXlshN6sYYNSMTizpgCVqITEKcEthrDZU=,2,12


In [116]:
x1.index.name

In [111]:
x1.columns

Index([False, True], dtype='object', name='is_correct')

In [124]:
x = pd.crosstab(merge_log_problem_info_userdata_info_content_df.uuid,merge_log_problem_info_userdata_info_content_df.ucid, values=merge_log_problem_info_userdata_info_content_df.is_correct, aggfunc='mean').reset_index().fillna(0)
x

ucid,uuid,+DlgHAr1GtoQgtGqwoen6pt4/ayVRO+rMaCVZ7jGCHE=,+IgBffWedJpxG6Zo/kHbrgIRR4jwwTwa6nV03GLwX5A=,+Lgj0Me9/8gtiruGKq8KxemAD15kU4yCfb6nteNDWjw=,+N+e7SzcUVowUo7D4udR8rBKRmR1H7yuu5Tleqlhv48=,+ZVl8HEbTM1GOoCTt2wxAfzQswvAWL3L8e5mLuyy1dY=,+aXi8dpg0URdKkJtkicic7DrTHAWPSnR9bwD+QouANE=,+c+5BQh1a2P75VjWgg/aX+j7kru6tteMWrupq+MMQgw=,+cLbiKkV7+VhNPr7OGG3B3kPpZ8er0mvFvqPbM/gm+E=,+fQqnCkVTMs8GPkdZJWUcApXC/Ea9bVtxNIblqLRie4=,...,z9WEaz3dU4bh7FPuKdAt6UrOgqpVnwWFrb/AL4dkKJs=,zFUHHn64b2GorjfAU6+BVy85oSZxzLhsrqGzKB7JwZc=,zHDjrBKb5lNSncPDp2hPXdb1zXdHj1Y/iQvzW7DTaIQ=,zHqZC5BMQzmA8ZbWd776nwinT6LWg2mxiT0WiW/5YZE=,zX5HAUpRYC4F7YPNs5tUeBqVi++rEmPUfsAtHpoZmEQ=,zcH7Yl/gFwkU2Mej9UzQgaJL66wIdiMucPZwVyeqHxc=,zh/Ht6E99DfEW1ZtC/mgD6MmKmVQEiH09UOhX+Dz3rI=,znVNbyVOWXo/XrF4bWN0DAWkAGmYu/jwUwO5BReUQxY=,zpLoCKHOugDScAlFt0XHJJKFwU+r3YZaBlTf5G7qWpE=,zxxjH0rOlRk3Fe7P3H8+00eXPXeazjvuIXf2bdEyhJo=
0,++5bdNp/LZvGenJ8Brp4n2SfS9d4pu4qA7cF7FQW7hk=,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,++E4TrlDYvGtPBg1edhkLXLEEbnfiAgAamPQ33vpW8M=,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,++KRUQaY4gFsmV7egJVOWqbqbTth/oJBB7yX1HvCSL4=,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,++LzeMQy/8bNUGD2K5Ms/GdiBlQ16ii82xNw3BYtOZE=,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41021,zzi+wYqYExc64gAT8ir+SILPXPOU7MZBYthQfazp620=,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41022,zztwLsjf44Uzumou84qh04wObpugkIllLCIp/mgoMu8=,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41023,zzujjvdKzb4wL8cFKdF3Of5gg2XHzIDQCZ2TfaYpSNs=,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41024,zzvvUOwcSXXTXd8xoyimNcm2DjeVPLmZAI19WKfhSAM=,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [125]:
x.describe()

ucid,+DlgHAr1GtoQgtGqwoen6pt4/ayVRO+rMaCVZ7jGCHE=,+IgBffWedJpxG6Zo/kHbrgIRR4jwwTwa6nV03GLwX5A=,+Lgj0Me9/8gtiruGKq8KxemAD15kU4yCfb6nteNDWjw=,+N+e7SzcUVowUo7D4udR8rBKRmR1H7yuu5Tleqlhv48=,+ZVl8HEbTM1GOoCTt2wxAfzQswvAWL3L8e5mLuyy1dY=,+aXi8dpg0URdKkJtkicic7DrTHAWPSnR9bwD+QouANE=,+c+5BQh1a2P75VjWgg/aX+j7kru6tteMWrupq+MMQgw=,+cLbiKkV7+VhNPr7OGG3B3kPpZ8er0mvFvqPbM/gm+E=,+fQqnCkVTMs8GPkdZJWUcApXC/Ea9bVtxNIblqLRie4=,+g5YlvjiVYzjFfpMkHEGCty7PGcKRuuCVDY/g8dzoD0=,...,z9WEaz3dU4bh7FPuKdAt6UrOgqpVnwWFrb/AL4dkKJs=,zFUHHn64b2GorjfAU6+BVy85oSZxzLhsrqGzKB7JwZc=,zHDjrBKb5lNSncPDp2hPXdb1zXdHj1Y/iQvzW7DTaIQ=,zHqZC5BMQzmA8ZbWd776nwinT6LWg2mxiT0WiW/5YZE=,zX5HAUpRYC4F7YPNs5tUeBqVi++rEmPUfsAtHpoZmEQ=,zcH7Yl/gFwkU2Mej9UzQgaJL66wIdiMucPZwVyeqHxc=,zh/Ht6E99DfEW1ZtC/mgD6MmKmVQEiH09UOhX+Dz3rI=,znVNbyVOWXo/XrF4bWN0DAWkAGmYu/jwUwO5BReUQxY=,zpLoCKHOugDScAlFt0XHJJKFwU+r3YZaBlTf5G7qWpE=,zxxjH0rOlRk3Fe7P3H8+00eXPXeazjvuIXf2bdEyhJo=
count,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,...,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0
mean,0.000803,0.000521,0.002397,0.019039,0.005714,0.019643,0.029784,0.003422,0.005226,0.001005,...,0.015914,0.018414,0.001906,0.00472,0.035796,0.02057,0.001127,0.001298,0.035651,0.017713
std,0.02437,0.018106,0.044491,0.121991,0.064726,0.131742,0.160199,0.041637,0.064896,0.028242,...,0.115583,0.125022,0.035734,0.061259,0.171184,0.124708,0.028594,0.032297,0.166971,0.121842
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [118]:
x.describe()

ucid,+DlgHAr1GtoQgtGqwoen6pt4/ayVRO+rMaCVZ7jGCHE=,+IgBffWedJpxG6Zo/kHbrgIRR4jwwTwa6nV03GLwX5A=,+Lgj0Me9/8gtiruGKq8KxemAD15kU4yCfb6nteNDWjw=,+N+e7SzcUVowUo7D4udR8rBKRmR1H7yuu5Tleqlhv48=,+ZVl8HEbTM1GOoCTt2wxAfzQswvAWL3L8e5mLuyy1dY=,+aXi8dpg0URdKkJtkicic7DrTHAWPSnR9bwD+QouANE=,+c+5BQh1a2P75VjWgg/aX+j7kru6tteMWrupq+MMQgw=,+cLbiKkV7+VhNPr7OGG3B3kPpZ8er0mvFvqPbM/gm+E=,+fQqnCkVTMs8GPkdZJWUcApXC/Ea9bVtxNIblqLRie4=,+g5YlvjiVYzjFfpMkHEGCty7PGcKRuuCVDY/g8dzoD0=,...,z9WEaz3dU4bh7FPuKdAt6UrOgqpVnwWFrb/AL4dkKJs=,zFUHHn64b2GorjfAU6+BVy85oSZxzLhsrqGzKB7JwZc=,zHDjrBKb5lNSncPDp2hPXdb1zXdHj1Y/iQvzW7DTaIQ=,zHqZC5BMQzmA8ZbWd776nwinT6LWg2mxiT0WiW/5YZE=,zX5HAUpRYC4F7YPNs5tUeBqVi++rEmPUfsAtHpoZmEQ=,zcH7Yl/gFwkU2Mej9UzQgaJL66wIdiMucPZwVyeqHxc=,zh/Ht6E99DfEW1ZtC/mgD6MmKmVQEiH09UOhX+Dz3rI=,znVNbyVOWXo/XrF4bWN0DAWkAGmYu/jwUwO5BReUQxY=,zpLoCKHOugDScAlFt0XHJJKFwU+r3YZaBlTf5G7qWpE=,zxxjH0rOlRk3Fe7P3H8+00eXPXeazjvuIXf2bdEyhJo=
count,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,...,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0,41026.0
mean,0.006313,0.006411,0.013162,0.169185,0.05438,0.143811,0.200751,0.067859,0.05031,0.010237,...,0.129796,0.127895,0.022035,0.039317,0.367036,0.233072,0.010871,0.013236,0.348876,0.126237
std,0.203645,0.270207,0.281845,1.369296,0.778218,1.229532,1.291104,1.98303,0.840129,0.406159,...,1.174997,1.006111,0.525848,0.633372,2.214831,1.713163,0.300037,0.352783,2.320098,1.000466
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,13.0,27.0,15.0,86.0,76.0,58.0,54.0,231.0,76.0,59.0,...,61.0,26.0,65.0,41.0,129.0,60.0,18.0,19.0,165.0,48.0


In [126]:
xy = x.values.T
xy.shape

(1316, 41026)

In [134]:
x.head()

ucid,uuid,+DlgHAr1GtoQgtGqwoen6pt4/ayVRO+rMaCVZ7jGCHE=,+IgBffWedJpxG6Zo/kHbrgIRR4jwwTwa6nV03GLwX5A=,+Lgj0Me9/8gtiruGKq8KxemAD15kU4yCfb6nteNDWjw=,+N+e7SzcUVowUo7D4udR8rBKRmR1H7yuu5Tleqlhv48=,+ZVl8HEbTM1GOoCTt2wxAfzQswvAWL3L8e5mLuyy1dY=,+aXi8dpg0URdKkJtkicic7DrTHAWPSnR9bwD+QouANE=,+c+5BQh1a2P75VjWgg/aX+j7kru6tteMWrupq+MMQgw=,+cLbiKkV7+VhNPr7OGG3B3kPpZ8er0mvFvqPbM/gm+E=,+fQqnCkVTMs8GPkdZJWUcApXC/Ea9bVtxNIblqLRie4=,...,z9WEaz3dU4bh7FPuKdAt6UrOgqpVnwWFrb/AL4dkKJs=,zFUHHn64b2GorjfAU6+BVy85oSZxzLhsrqGzKB7JwZc=,zHDjrBKb5lNSncPDp2hPXdb1zXdHj1Y/iQvzW7DTaIQ=,zHqZC5BMQzmA8ZbWd776nwinT6LWg2mxiT0WiW/5YZE=,zX5HAUpRYC4F7YPNs5tUeBqVi++rEmPUfsAtHpoZmEQ=,zcH7Yl/gFwkU2Mej9UzQgaJL66wIdiMucPZwVyeqHxc=,zh/Ht6E99DfEW1ZtC/mgD6MmKmVQEiH09UOhX+Dz3rI=,znVNbyVOWXo/XrF4bWN0DAWkAGmYu/jwUwO5BReUQxY=,zpLoCKHOugDScAlFt0XHJJKFwU+r3YZaBlTf5G7qWpE=,zxxjH0rOlRk3Fe7P3H8+00eXPXeazjvuIXf2bdEyhJo=
0,++5bdNp/LZvGenJ8Brp4n2SfS9d4pu4qA7cF7FQW7hk=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,++E4TrlDYvGtPBg1edhkLXLEEbnfiAgAamPQ33vpW8M=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,++KRUQaY4gFsmV7egJVOWqbqbTth/oJBB7yX1HvCSL4=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,++LzeMQy/8bNUGD2K5Ms/GdiBlQ16ii82xNw3BYtOZE=,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [132]:
from sklearn.decomposition import TruncatedSVD
from scipy.sparse.linalg import svds

SVD = TruncatedSVD(n_components=12)
matrix = SVD.fit_transform(xy)
matrix.shape

ValueError: could not convert string to float: '++5bdNp/LZvGenJ8Brp4n2SfS9d4pu4qA7cF7FQW7hk='

In [49]:
pd.crosstab([merge_log_problem_info_userdata_info_content_df.uuid,merge_log_problem_info_userdata_info_content_df.ucid], merge_log_problem_info_userdata_info_content_df.is_correct).reset_index(drop=True)

RangeIndex(start=0, stop=784766, step=1)

In [67]:
y=x.reset_index(level=[0,1])
y

is_correct,uuid,ucid,False,True
0,++5bdNp/LZvGenJ8Brp4n2SfS9d4pu4qA7cF7FQW7hk=,VY6aXT7f64ny+uy4pszHVNSy3WHyoFPuhwToxBhB3wM=,0,5
1,++E4TrlDYvGtPBg1edhkLXLEEbnfiAgAamPQ33vpW8M=,Qx6mwirYKln7CTvOXad5Do5OkVKmypYSQfFs0MB6Cvs=,3,7
2,++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,412PAnenNdYglQWXSlVtS1RYA7Yg60Wty166LCMaiHU=,0,8
3,++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,HKqPgUALqZGw984KFqAMeDwDJTi9cNhJ51UXfjzELSg=,3,6
4,++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,VFSfDdE2vLsyVfGCGYLnxYih+n8+IpUbRoiUpDJ4hc4=,0,5
...,...,...,...,...
784761,zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,EeAQRS+kGarrxqFWxO5U2lJ12zE7xgtIiHO9ojBNYpw=,0,5
784762,zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,HiffIWPAC1YJDwu8WRYatMOmxW/ufbs1/6A6HoWj1dU=,1,5
784763,zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,m6DnlS38gmb+QFf7102iqmQim9m7dKlTfZ7bPLN+Ato=,0,3
784764,zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=,oAoQyh9NT0lXlshN6sYYNSMTizpgCVqITEKcEthrDZU=,2,12


In [78]:
y[0]

0         0
1         3
2         0
3         3
4         0
         ..
784761    0
784762    1
784763    0
784764    2
784765    0
Length: 784766, dtype: int64

In [78]:
y[0]

0         0
1         3
2         0
3         3
4         0
         ..
784761    0
784762    1
784763    0
784764    2
784765    0
Length: 784766, dtype: int64

In [84]:
y.uuid

0         ++5bdNp/LZvGenJ8Brp4n2SfS9d4pu4qA7cF7FQW7hk=
1         ++E4TrlDYvGtPBg1edhkLXLEEbnfiAgAamPQ33vpW8M=
2         ++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=
3         ++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=
4         ++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=
                              ...                     
784761    zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=
784762    zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=
784763    zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=
784764    zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=
784765    zzyukk9LNl7j/9xxgZj6roQBbGn1MjSPTCgc0r5sXpU=
Name: uuid, Length: 784766, dtype: object

In [83]:
y.columns

Index(['uuid', 'ucid', False, True], dtype='object', name='is_correct')

In [85]:
z = y[1]

In [87]:
y.pivot(index='uuid', columns='ucid', values=y[1]).fillna(0)

is_correct,False,False,False,False,False,False,False,False,False,False,...,True,True,True,True,True,True,True,True,True,True
ucid,+DlgHAr1GtoQgtGqwoen6pt4/ayVRO+rMaCVZ7jGCHE=,+IgBffWedJpxG6Zo/kHbrgIRR4jwwTwa6nV03GLwX5A=,+Lgj0Me9/8gtiruGKq8KxemAD15kU4yCfb6nteNDWjw=,+N+e7SzcUVowUo7D4udR8rBKRmR1H7yuu5Tleqlhv48=,+ZVl8HEbTM1GOoCTt2wxAfzQswvAWL3L8e5mLuyy1dY=,+aXi8dpg0URdKkJtkicic7DrTHAWPSnR9bwD+QouANE=,+c+5BQh1a2P75VjWgg/aX+j7kru6tteMWrupq+MMQgw=,+cLbiKkV7+VhNPr7OGG3B3kPpZ8er0mvFvqPbM/gm+E=,+fQqnCkVTMs8GPkdZJWUcApXC/Ea9bVtxNIblqLRie4=,+g5YlvjiVYzjFfpMkHEGCty7PGcKRuuCVDY/g8dzoD0=,...,z9WEaz3dU4bh7FPuKdAt6UrOgqpVnwWFrb/AL4dkKJs=,zFUHHn64b2GorjfAU6+BVy85oSZxzLhsrqGzKB7JwZc=,zHDjrBKb5lNSncPDp2hPXdb1zXdHj1Y/iQvzW7DTaIQ=,zHqZC5BMQzmA8ZbWd776nwinT6LWg2mxiT0WiW/5YZE=,zX5HAUpRYC4F7YPNs5tUeBqVi++rEmPUfsAtHpoZmEQ=,zcH7Yl/gFwkU2Mej9UzQgaJL66wIdiMucPZwVyeqHxc=,zh/Ht6E99DfEW1ZtC/mgD6MmKmVQEiH09UOhX+Dz3rI=,znVNbyVOWXo/XrF4bWN0DAWkAGmYu/jwUwO5BReUQxY=,zpLoCKHOugDScAlFt0XHJJKFwU+r3YZaBlTf5G7qWpE=,zxxjH0rOlRk3Fe7P3H8+00eXPXeazjvuIXf2bdEyhJo=
uuid,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
++5bdNp/LZvGenJ8Brp4n2SfS9d4pu4qA7cF7FQW7hk=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++E4TrlDYvGtPBg1edhkLXLEEbnfiAgAamPQ33vpW8M=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++G4mkLfs4WDYhc1Ga+3G+/oqSniQQvLBm7SBQ3V39Y=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++KRUQaY4gFsmV7egJVOWqbqbTth/oJBB7yX1HvCSL4=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
++LzeMQy/8bNUGD2K5Ms/GdiBlQ16ii82xNw3BYtOZE=,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zzi+wYqYExc64gAT8ir+SILPXPOU7MZBYthQfazp620=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zztwLsjf44Uzumou84qh04wObpugkIllLCIp/mgoMu8=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzujjvdKzb4wL8cFKdF3Of5gg2XHzIDQCZ2TfaYpSNs=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzvvUOwcSXXTXd8xoyimNcm2DjeVPLmZAI19WKfhSAM=,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
