xLearn ml-latest

Experienmet apply xLearn FM model on movielens dataset.

Environment set up

Clone this project

git clone --recurse-submodules git@github.com:king0980692/xLearn_ml-latest.git

Prerequisites

pip

using venv

python3 -m venv xlearn_ml
source xlearn_ml/bin/activate

pip install -r requirements.txt

poetry

using python version: 3.8.1

pyenv local 3.8.1

tell poetry to using python3.8 and install the dependency package

poetry env use python3.8
poetry install

enter the virtual enviornment for runing script more simply .

poetry shell

Experiment step

below using the movielens-100k dataset to describe the detail step for this experiment.

Prepare data

mkidr ./data
wget https://files.grouplens.org/datasets/movielens/ml-100k.zip -P ./data
unzip ./data/ml-100k.zip -d ./data

Using encoderder to generate the sparse format data

we will create the libsvm format training data, and the all pairs of user and item test data to predict the probability.

python3 ./encoderder/encoderder.py -c ./100k.json

training file:

$ head ./exp/ml.train

5 1:1  944:1
3 1:1  1736:1
4 1:1  1847:1
3 1:1  1958:1
3 1:1  2069:1
5 1:1  2180:1
4 1:1  2291:1
1 1:1  2402:1
5 1:1  2513:1
3 1:1  945:1

all_pair file:

$ head ./exp/ml.test.all_pair
1:1 944:1
1:1 945:1
1:1 946:1
1:1 947:1
1:1 948:1
1:1 949:1
1:1 950:1
1:1 951:1
1:1 952:1
1:1 953:1

100k.json

this is a json file for encoderder to generate the sparse format of datast, it will look like :

"train": {
    "input": "./data/ml-100k/ua.base",
    "output": "./exp/ml.train",
    "cached": true,
    "seperator": "\t",
    "header": false,
    "sparse": false,
    "target_columns": [
        {
            "index": 0,
            "type": "cat"
        },
        {
            "index": 1,
            "type": "cat"
        },
        {
            "index": 2,
            "type": "truth"
        }
    ]
}

there some important points need to illustrate :

input : the input file to generate the sparse format
output : the generated file
target columns : select your interested column you want to encode, and specify its column type:
- cat : categorical type data
- num: numerical type data
- truth : the labeled data
others config : you can see the more infomation in encoderder repository

Training and Testing

using above training file to train and predict the probability of all pair

python3 train_predict.py --train ./exp/ml.train --test ./exp/ml.test.all_pair --output ./result/output.txt

Generate the user pred pickle

Generate the user prediction based on the test file's user .

python3 gen_user_pred.py --score_file ./result/output.txt --truth_file ./exp/ml.test.all_pair

above command will generate a pickle file at ./result/user_pred.pkl, which is a python dict structure. It collect all user's prediction result, its format will look like:

print(user_pred[1])

'''
output will look like 
[('1101',4.78123),('312',4.18312),....]
'''

Evaluation

Final, read the pickle file and use the actual file to evaluate the predicton result.

python3 eval.py --predict ./result/user_pred_dict.pkl --truth ./exp/ml.test

Performance

	MAP@10
ml-latest	0.006496
ml-100k	0.0036496
ml-10m	0.0023612

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
data		data
encoderder @ 353e6e0		encoderder @ 353e6e0
exp		exp
result		result
.gitignore		.gitignore
.gitmodules		.gitmodules
100k.json		100k.json
10m.json		10m.json
README.md		README.md
eval.py		eval.py
gen_user_pred.py		gen_user_pred.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_100k.sh		run_100k.sh
run_1m.sh		run_1m.sh
split.py		split.py
train_predict.py		train_predict.py

king0980692/xLearn_ml-latest

Folders and files

Latest commit

History

Repository files navigation

xLearn ml-latest

Environment set up

Clone this project

Prerequisites

pip

poetry

Experiment step

Prepare data

Using encoderder to generate the sparse format data

100k.json

Training and Testing

Generate the user pred pickle

Evaluation

Performance

About

Resources

Stars

Watchers

Forks

Languages