Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Matching Script and Associated Changes #100

Merged
merged 4 commits into from Jun 14, 2017

Conversation

andersonfrailey
Copy link
Collaborator

This PR is the culmination of the work discussed in #92. It removes all of the SAS Matching scripts and replaces them with python scripts.

Instructions for how to run the scripts can be found in the updated README.md file.

It also updates stage2/3.py and the stage 2 notebook so that they can be used with the Python produced matched file.

@martinholmer @MattHJensen @Amy-Xu @hdoupe

@martinholmer
Copy link
Contributor

martinholmer commented Jun 13, 2017

@andersonfrailey, taxdata pull request #100 is excellent work. Thanks for your efforts on this complicated task.

@martinholmer
Copy link
Contributor

@andersonfrailey, I have a couple of questions about taxdata pull request #100.

  1. I'm assuming the following is true: "if Python Matching Script and Associated Changes #100 is merged, then executing ./csvmake puf 3 will produce the exact same puf_weights.csv and puf_ratios.csv files as included in Python Matching Script and Associated Changes #100. Right?

  2. Looks like the following code in the match function of puf_data/StatMatch/Matching/runmatch.py implies that the age_consistency function in the puf_data/finalprep.py file should be removed. Right? (If correct, I'll be happy to do that.)

    print('Creating final file')
    cpsrets = add_cps(filers, match, puf)
    cps_matched = add_nonfiler(cpsrets, nonfilers)
    # Rename variables for use in PUF data prep
    renames = {'icps1': 'age_head',
               'icps2': 'age_spouse',
               'wasp': 'wage_head',
               'wass': 'wage_spouse'}
    cps_matched = cps_matched.rename(columns=renames)

    return cps_matched
  1. Looking ahead, I'm wondering why we are using the cylp package instead of the already installed scipy package to do linear programming? The linear programming capability we need for stage2 seems to be available in scipy.optimize.linprog. The big advantage of scipy is that it comes as part of the Anaconda Python distribution. While I can't even figure out how to install cylp: the command conda install doesn't work and the command pip install doesn't work. Shouldn't we be migrating from cylp to scipy for our linear programming capabilities?

@MattHJensen @Amy-Xu

@hdoupe
Copy link
Collaborator

hdoupe commented Jun 14, 2017

Looking ahead, I'm wondering why we are using the cylp package instead of the already installed scipy package to do linear programming? The linear programming capability we need for stage2 seems to be available in scipy.optimize.linprog. The big advantage of scipy is that it comes as part of the Anaconda Python distribution. While I can't even figure out how to install cylp: the command conda install doesn't work and the command pip install doesn't work. Shouldn't we be migrating from cylp to scipy for our linear programming capabilities?

@martinholmer, I looked into this. Here's the email I sent @andersonfrailey with my conclusions:

"After looking into SciPy’s linsolve, it doesn’t look very reliable. This pull-request from March 2017 proposes a completely different solver than the one implemented now. It also points out a bunch of places where the current SciPy solver performs sub optimally. So, if we use SciPy, we would have to carefully inspect the results to make sure that they are accurate.

On the other hand, according to this paper by a government contractor, CLP is the best open source linear program solver. And CyLp is a Python interface to this package. The paper was written in 2013, but I think the findings are still relevant. The downsides of CyLp are that it is not Python 3+ compatible and it is not being actively maintained. Although there is some activity on its github page, there does not appear to be any major development.

For now, I think that we should keep using CyLp. It’s just more reliable right now. We should keep an eye on the development of Scipy’s linsolve. I would much rather be in the SciPy/NumPy sphere."

Here's an additional link to a thread on the lin_solve package development on the scipy mailing list.

@andersonfrailey
Copy link
Collaborator Author

@martinholmer

I'm assuming the following is true: "if #100 is merged, then executing ./csvmake puf 3 will produce the exact same puf_weights.csv and puf_ratios.csv files as included in #100. Right?

Correct.

Looks like the following code in the match function of puf_data/StatMatch/Matching/runmatch.py implies that the age_consistency function in the puf_data/finalprep.py file should be removed. Right? (If correct, I'll be happy to do that.)

I believe we should keep the age_consistency function. I added a more detailed explanation in issue #92.

Looking ahead, I'm wondering why we are using the cylp package instead of the already installed scipy package to do linear programming? The linear programming capability we need for stage2 seems to be available in scipy.optimize.linprog. The big advantage of scipy is that it comes as part of the Anaconda Python distribution. While I can't even figure out how to install cylp: the command conda install doesn't work and the command pip install doesn't work. Shouldn't we be migrating from cylp to scipy for our linear programming capabilities?

@hdoupe and I have talked about this recently. While I agree it would be preferable to use the scipy package, after @hdoupe did some digging into the package, we came to the conclusion it wouldn't be an improvement over CyLp yet. Hopefully sometime down the road it will be better suited for our needs.

@hdoupe
Copy link
Collaborator

hdoupe commented Jun 14, 2017

While I can't even figure out how to install cylp: the command conda install doesn't work and the command pip install doesn't work.

@martinholmer, setting up cylp is a pain. I had to install Python 2.7, and then easy_install cylp worked.

@Amy-Xu
Copy link
Member

Amy-Xu commented Jun 14, 2017

@martinholmer said

Looking ahead, I'm wondering why we are using the cylp package instead of the already installed scipy package to do linear programming? The linear programming capability we need for stage2 seems to be available in scipy.optimize.linprog. The big advantage of scipy is that it comes as part of the Anaconda Python distribution. While I can't even figure out how to install cylp: the command conda install doesn't work and the command pip install doesn't work. Shouldn't we be migrating from cylp to scipy for our linear programming capabilities?

I totally agree we should make this process easier for users. At the beginning there wasn't a very specific reason to use cylp -- simply it was the only LP solver in python I made work with a large equation set like the one we have. I'm happy to offer help if anyone has the bandwidth to revamp the code for scipy.

@andersonfrailey @hdoupe

@martinholmer
Copy link
Contributor

martinholmer commented Jun 14, 2017

@hdoupe said:

setting up cylp is a pain. I had to install Python 2.7, and then easy_install cylp worked.

Thanks for the tip on installing the cylp package. I'll try easy_install.
That worked and I was able to check #100 on my own computer.

@martinholmer
Copy link
Contributor

@andersonfrailey,
Pull request #100 will not run (that is, ./csvmake puf 3 fails) because recent updates in the puf_data/finalprep.py file have not been incorporated in #100.

Please do something like this:

git checkout master
./gitsync
git checkout pythonpuf
git merge master
git push origin pythonpuf

Meanwhile I'm running ./csvmake puf 3 on my computer as a final test of #100.

@MattHJensen @hdoupe

@martinholmer martinholmer merged commit 5c3273f into PSLmodels:master Jun 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants