Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Reproduce LLMLingua-2 results with Mistral-7B #155

Open
xvyaward opened this issue May 21, 2024 · 4 comments
Open

[Question]: Reproduce LLMLingua-2 results with Mistral-7B #155

xvyaward opened this issue May 21, 2024 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@xvyaward
Copy link

Describe the issue

First of all, thank you for your great contributions.

I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.

compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf)
llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model)
Hardware platform: 1 Nvidia A100-80GB

Here are some results from the paper and my reproduced scores:

MeetingBank MeetingBank LongBench
QA summary 2000 token avg. 2000 token narrativeqa multifieldqa_en multifieldqa_zh qasper
LLMLingua-2 76.22 30.18 26.8
Original prompt 66.95 26.26 24.5
LLMLingua-2 reproduced 73.59 29.95 25.65 10.07 36.61 26.47 29.46
Original prompt reproduced 66.05 26.89 26.47 10.05 38.7 31.46 25.67

I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.

Here is the example process that I followed for MeetingBank QA evaluation.

  1. I made meetingbank_test_3qa_pairs_summary_formated.json by modifying format_data.py.
  2. Made compressed_prompt using
python compress.py --load_origin_from ../../../results/meetingbank/origin/meetingbank_test_3qa_pairs_summary_formated.json \
    --model_name microsoft/llmlingua-2-xlm-roberta-large-meetingbank
    --compression_rate 0.33 \
    --force_tokens "\n,?,!,." \
    --save_path ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json
  1. evaluate with
python eval_meetingbank_qa_local_llm.py --load_prompt_from ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json \
    --load_key compressed_prompt \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --save_path ../../../results/meetingbank/llmlingua2/mistral_7b/answer_ratio33_meetingbank_test_3qa_pairs_summary_formated.json

I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model.
If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b?
Thank you for reading.

@iofu728
Copy link
Contributor

iofu728 commented May 24, 2024

Hi @xvyaward, thanks for your support in LLMLingua-2 and share detailed results. These results seem quite good and are generally similar to ours. I would like to confirm which specific metric you are most concerned about that did not meet your expectations.

@pzs19
Copy link
Contributor

pzs19 commented May 24, 2024

Hi @xvyaward, thanks for your interest and the very detailed description.

  1. The multifieldqa_zh should be excluded here. As for Chinese, we have evaluated the performance of LLMLingua-2 on Chinese in another experiment, please refer to the Table 9 of our paper for the results.

  2. Could you please share more information on how you use the mistral model for inference? Since the sampling parameters and evaluation strategies can have an impact on the overall performance, such as the temperature and whether to truncate the answer when "\n" appears.

As for our experiment, we use the official github repo of mistral for inference and download the model from mistralcdn.

Hope these explanations can help you.

@mzf666
Copy link

mzf666 commented Jul 24, 2024

Describe the issue

First of all, thank you for your great contributions.

I have a similar question to the issue 146, I cannot reproduce the Table 4 results in the LLMLingua-2 paper.

compress model: microsoft/llmlingua-2-xlm-roberta-large-meetingbank (downloaded from hf) llm: mistralai/Mistral-7B-v0.1 (also downloaded from HF, not an instruction-tuned model) Hardware platform: 1 Nvidia A100-80GB

Here are some results from the paper and my reproduced scores:

MeetingBank MeetingBank LongBench
QA summary 2000 token avg. 2000 token narrativeqa multifieldqa_en multifieldqa_zh qasper
LLMLingua-2 76.22 30.18 26.8
Original prompt 66.95 26.26 24.5
LLMLingua-2 reproduced 73.59 29.95 25.65 10.07 36.61 26.47 29.46
Original prompt reproduced 66.05 26.89 26.47 10.05 38.7 31.46 25.67
I'm not sure whether I should include multifieldqa_zh for calculating the average of LongBench singledoc QA scores, but even excluding it gives an inconsistent average score.

Here is the example process that I followed for MeetingBank QA evaluation.

  1. I made meetingbank_test_3qa_pairs_summary_formated.json by modifying format_data.py.
  2. Made compressed_prompt using
python compress.py --load_origin_from ../../../results/meetingbank/origin/meetingbank_test_3qa_pairs_summary_formated.json \
    --model_name microsoft/llmlingua-2-xlm-roberta-large-meetingbank
    --compression_rate 0.33 \
    --force_tokens "\n,?,!,." \
    --save_path ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json
  1. evaluate with
python eval_meetingbank_qa_local_llm.py --load_prompt_from ../../../results/meetingbank/llmlingua2/compression_ratio33_meetingbank_test_3qa_pairs_summary_formated.json \
    --load_key compressed_prompt \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --save_path ../../../results/meetingbank/llmlingua2/mistral_7b/answer_ratio33_meetingbank_test_3qa_pairs_summary_formated.json

I modified eval_meetingbank_qa.py to make eval_meetingbank_qa_local_llm.py to use the vLLM + local hf mistral-7b model. If there is no problem with the reproduction process, is it possible to share the code for evaluation using mistral 7b? Thank you for reading.

Thanks for sharing your issue. May I know how to modify the format_data.py to obtain meetingbank_test_3qa_pairs_summary_formated.json? I am not sure how to conduct this procedure.

@cornzz
Copy link

cornzz commented Sep 3, 2024

Edit: TLDR - the model (downloaded from mistralcdn and using the mistral-inference library) starts generating nonsense (see below) at sequence lenghts of 2000-2300 tokens, which is far below the theoretical context window. Using the huggingface version, I reproduce your results roughly, but even then it starts doing the same at ~4k tokens.
My main question: which exact revision of the inference library you used and what n_max_tokens value was used to truncate the prompts?


Hi @pzs19, I am also currently trying to reproduce the results using mistral 7b v0.1 downloaded from mistral cdn, using the mistralai/mistral-inference repository. My results are currently not even close to the results from the paper (less than 40) so I assume I must be doing something wrong.

I am running on an A100 40GB. This is my code, roughly:

Code
model = Transformer.from_folder("models/mistral-7B-v0.1", device="cuda:7")
tokenizer = MistralTokenizer.from_file("models/mistral-7B-v0.1/tokenizer.model").instruct_tokenizer.tokenizer

tokens = tokenizer.encode(prompt, bos=True, eos=False)
out_tokens, logprobs = generate([tokens], model, max_tokens=100, temperature=0)

I am not setting eos_id as the provided evaluation script takes care of truncating the answer. I am also using the prompt template provided in eval_meetingbank_qa.py: "Write a high-quality answer for the given question using the provided meeting transcript (which may be compressed).\n{transcript}\nQuestion:{question}\nAnswer:"

A problem I ran into is, that the model often just replies with a bunch of newline characters and nothing else. I believe this happens when the input prompt exceeds a certain length:

Example of bad answers
    "3": {
        "transcript": "proposed ordinance 2016 0392. transportation concurrency. Transportation Department medal halo. Mr. Carlson begin briefing 2006 0392?. proposed ordinance county transportation concurrency program unincorporated area. modifies King County Code concurrency approves concurrency travel map test results map. concurrency. brief outline. Jay Osborne Rhodes. two members Transportation Concurrency Expert Review Panel. concurrency language King County Comprehensive Plan transportation chapter sets requirements concurrency program. establishes service standards land use areas. rural area service. requires concurrency travel sheds testing traffic flow arterials. committee assistant call maps. current law 25 travel codes boundaries map. travel sheds arterials test travel speeds. two years. data map travel sheds close development concurrency. 15% miles meet standard. next.switch next without. next slide?. Christchurch travel sheds closed five of 25 congestion afternoon peak. exceeded standards. proposed change in 392 changes. travel shed boundaries. map new boundaries. reflect changes unincorporated area. separates urban travel shields. urban unincorporated travel shelters littered. new rural travel sheds numbered larger reflect annexations. logical configuration roads arterial test performed. new data local firm travel speeds. thorough process evaluating travel time old practice. better picture actual travel times. change ordinance current system state routes used in concurrency comp plan policy. new proposal not use state routes with county owned arterial routes. complicated project suggest Jay come up. questions explaining difficult. councilmember. Dombroski has question.Chair Paul for work on approach to concurrency. last issue current plan permits travel time analysis state roads integral part of world transportation system. rationale for not using them? average driver distinguish state county road. certain state roads. not statewide significant like freeways. roads similar county arterials service standard set. Puget Sound Regional Council out of control. decision on roads within county's jurisdiction control. data gathered for state routes. one difference. using state roads level service standard development restricted. Jay Osborne deputy director roads. state highways 900. level service D in rural area counties level B state routes against counties level B pass meet state level D concurrency test.complications test state routes at our standard not states't meet standard state standard passing concurrency testing state routes in rural area. implication of policy choice to property build homes. In rural area? using test proposed pass ability. Councilmember Lambert's district. used state standards roads.?. might close travel show? use county stamp. Staff applied county to state. Thank you Madam Chair. for patience. Councilmember speak in detail. councilmember judging concurrency at intersections. new to me. lay out calculated. causes failures average of roads one road failing doesn mean whole travel failing. if change made impact more congestion development. reasonable reaction. lower standards for free flowing traffic for development energy. many ways to test concurrency.counties methodology actual travel time staff in cars with stopwatches driving novelty road. between three and 6 p. m. spring school not spring break. people standing on overpasses with stopwatch. whole length of roads. arterial roads tested. rule 85% meet standard for pass concurrency. state routes test at counties level traffic don't meet level service. grad students U. Dub counties level of service B highest. aspirational level service. state D on routes different standard testing creates complications. added state route testing data for 24, month testing every day Monday through Friday. use Tuesday, Wednesday Thursday data. data points test traffic index data. based on cell phones. more cost effective than paying staff testing areas staff drive roads planning group.rural area key zoning code with development small for impacts on county road system development cars on road in open sheds. impact. climate should be thought. bigger than county dealing with unincorporated areas. regionally consistent concurrency. differently in rural urban areas. Seattle cities like Bellevue different. hard to common vocabulary for public around or roads. driven transportation policy by individual anecdotes experience. important commutes. better if systematic clear way whole system. wish including throughput calculations not just vehicles important. many people move through points from A to B arterial well served by transit more people than not well-served. thank you for listening. next work with colleagues appreciate work. sense splitting urban from rural sense. Thank you.Councilmember been here years ago old currency plan 300 360 boxes? nightmare. hired national firm feedback. worst plan seen country. man said 30 years experience never seen like it. ditched afterwards. new improved. jurisdiction over Bellevue not mayor. unless through transportation committee. RC to make changes. people driving from rural into unincorporated areas one level service another. drivers commute traffic. one level of service aspirational artificial barrier. urban growth boundary line road from Senate line to rural area one level service center line to urban line another? years ago. fixed it. problem same road two levels service. makes easier. consistent with other roads county. much development left in county other than under Growth Management Act.Jay said ability deal with roads control over. final thing thank you. promise. map circles inside travel sheds urban islands in unincorporated areas. talk consternation about growth targets. map demonstrates debate one sided. need to grow. growing. large growth urban areas travel suffer growth. requiring free flowing traffic for development inner suburbs urban. demonstrates complexities debate. debates affordable housing lifestyles rural area cities growing. housing choice different housing choices. complicated issue.? page 46 packet list root segments failed analysis. miniatures travel total mileage travel shed. 85% or more passes test. less than 85% fails travel should failing one travel should fails new process. Jay alluded development in rural areas travel should fail provision for minor public educational developments proceed.concurrency system allowed form changed. section code amended proposed ordinance. Ten 7285. Lists minor developments schools uses forward if travel fails. important family parcel subdivide build house. county modified program accommodate complies zoning. concurrency first step developing. consistent with zoning. last map results. test results map. red arrow shows development mostly agricultural production district areas developed. APD parcels minimum ten acres not traffic. closed?. page 46 seven two. small small mileage two road segments half mile fail. odd area agricultural uses. 72nd to 77 main drag across valley through APD four lanes urban level tested service fee area rural. urban road tested rural level fails. why testing rural area APD?.? council member.looked concurrency issues year ago parked open issues testing done work. concern duties regional consistency narrow basis. interested in travel sheds urban rural line. East Renton Plateau question about whose standard?. look at adjacent city standard urban side line account for city planning policies zoning traffic standards. takeaway move to not cross urban rural line in travel sheds. travel sheds on centerline line one standards urban center line different step. question urban side city paid for standards into service standards. history of concurrency agreements with cities development. economy suffered 2008, four cities withdrew agreements develop areas impact mitigation payment system money development. complicated. conversations with cities about standards annex Kahani Amish acquire Fall City Road developing.currently agreements to model concurrency in areas remaining urban areas small chipped away. county state smallest unincorporated area. 12 and a half percent county unincorporated. Snohomish County has 28% in forties and above. done Growth Management Act Incorporated. talking about 12%. rural. Unincorporated. 12% of county. landmass bigger population about 300,000 out of two. 250. half percent lower than. land allocated to one five ten 20 acre parcels. aspirational level of B people off B road into city D or F rating. mile need realistic about level people land children live on property. allocations for. will of body? need 30 day period for public testimony. 30 day advertising period.vote committee or recommendation? amendment technical cleanup. policy changes? yes no. Amendment one slightly different packet. technical changes. spelling errors. new sentence section eight.?. section four ordinance online nine amendment except KCC 1470 285. technical clarification minor use covered by 285. not changing practice typographical change. line 13 Section 1470 L last item minor developments travel sheds. rewording clarity. line 18 property not subdivided last ten years. short subdivision rural travel owner subdivide. family method. current law applicant owned property five years not subdivided ten years allowed meets zoning requirements no need purchase transferrable development rights. executive proposal no subdivision ten years five years. rural policy 3 to plan ten years requirement.maintaining existing language for ten years not moving forward with change executive proposed. comprehensive policy language mandatory.? executive okay with revised amendment? Yes. preferred old new passed out. plan in four years reconsider. Councilmember Balducci. questions answered. Councilmember Dunn. issue worked on long time.? not understood by elected officials. changing modifying expanding travel sheds not changing methodology for concurrency standards.? methodology service. Level comprehensive plan establishes service standards for urban E rural B rural town centers D neighborhood centers D. change service standards latest update comp plan. remain red, yellow green map for concurrence.?.? modify mapping. moved to Travel Shed concept colors abandoned. need. homes. Condos low income housing places for young families everyone.with pro-development situation developing in earmuffs.? pro-development. bad. didn't win. question south of Issaquah Hobart Road road bad lost in traffic unhappy citizens. People can't out. Emergency services. Ambulances fire trucks. county refuses increase capacity. not development travel shed. status of Esquire open road travel 12. travel schedules new travel open. segment of Issaquah Hobart Road between Issaquah City Limits Southeast 127th Street. development permitted. total results travel not hit 15% or more mileage standards. never believed currency control land use planning development. wrong way. through zoning permitting issues. problem Hobart Road. willingness to increase capacity. money to increase capacity.Maybe little bit both point out. large developments RFI zoning add to problem disastrous. more statement than question. wonder broader travel sheds right way. policy not object against drill down further future. renditions concurrency from obnoxious. need Ph. D. simplified. growth small compared to. clarify. 12% population county unincorporated area. Half areas be annexed. 6% rural unincorporated. not lot. people need to comment important things studied.?. council appointed Transportation Concurrency Expert Review panel review work comment letter. represented from development community environmental community. citizen of unincorporated area representative Non-Motorized users bus and transit. Transportation Concurrency Expert Review Panel decided time end methodology development concurrency support.1:00 scholar my right chair of Transportation Concurrency Expert Review Panel remarks. Good morning council members thank you for time. long standing Martin here back her attendance conveys reflection panel committed long standing involvement with staff older approaches to concurrency approach materials. significant influenced by collaboration among diverse interests unanimous recommendation despite diversity interests panel. panel worked together many years. I recent addition knew background honored to taken over chairmanship years ago. proud of work well-educated staff master's program materials presented past year. staff selfless. interested in good of county system. data reflective of thoughtful work. panel felt well served by consensus well-educated group individuals succinctly convey information candid spirited dialog. feel good about what presented.sad about dissolving few fora talk candidly without interests good quality conversations. makes sense to dissolve. honored to served county. thank you for opportunity hope continue service in individual capacities. thank you complicated formula impactful. unanimous decision appreciated. thank for service. believe lack of growth no need to continue committee evaluate. panel believes travel sheds reformatted annexation processes mechanized methodology through INRIX need for panel to review aspects rote. don't need staff time materials deliverable. future comfortable with decision. thank for of assets like INRIX. other committees data helpful. before US Council member. move approval of proposed ordinance number 2016? Dash 0392 do pass recommendation. Thank you. questions or comments before vote?. Councilmember and Ambassador.offer Amendment one. question before vote final. explained by staff. exact name on. prefer changed. correctional errors typos. clarifications King County code. five ten year issue. passed comp plan voluminous change need flag three years ten months change back. code consistent good. favor presented staff say I opposed name passed amended version of 2016 0392. Council Member Dombroski comment. follow up Councilmember Dunn's inquiry capacity funding. project mitigation money? current provision spent within travel? SIPA money spending on roadway. specific projects travel. good clarification question. favor vote from clerk's office? Councilmember Baldacci. Gossage. Colwell. McDermott. group. vote six days zero no's councilmembers Gossett McDermott moderate.excused. want consent or talk again? needs put Thursday? Public comment. not enough. Thanks pointing 30 day advertising period. not regular schedule wait after 30 days end of February. no other business meeting adjourned. Thank you.",
        "questions": [
            "What is the proposed ordinance number discussed in the meeting?",
            "What is the level of service for roads in the rural area?",
            "Who had a question during the meeting?"
        ],
        "answers": [
            "2016 0392",
            "B",
            "Councilmember Dombroski"
        ],
        "model_answers": [
            "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
            "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n",
            "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
        ]
    },

Could you provide more detailed information on how you performed inference using mistral 7b? For example, what was the n_max_token parameter used and which version of the mistral-inference library was used?

Edit: I investigated a bit further, it seems the response quality starts to deteriorate heavily around a prompt length of 2000 tokens and beyond a length of around 2300 the answers consist exclusively of newline characters. I don't understand why, the context window is supposedly 8k?

Edit 2: is the prompt template from eval_meetingbank_qa.py exactly as used in your evaluation? I am asking as it is missing a space after Answer: which sometimes causes the model to ignore the question, while it answers the question if a space is added, making me wonder if this doesnt affect the benchmark score considerably.

Edit 3: I suspect I am either incorrectly using the mistral inference library, or it has been changed considerably since you used it for your evaluation. If I use the same model through huggingface it performs very well, with the results matching @xvyaward 's results... Any information on your setup would be very helpful, especially the version / revision of the mistral inference repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants