In [9]:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from io import BytesIO

import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pinecone
from sentence_transformers import SentenceTransformer
import pandas as pd
import torch
import shutil

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
DATA_DIR = "data"

## Data Collection

### Initial Realization

Our initial strategy was to locate an existing dataset that addressed our specific use case. However, after an extensive search, we found that no suitable dataset was available. We then considered web scraping as an alternative. After a detailed review of both the technical challenges and legal implications, we decided against scraping third-party websites due to explicit restrictions. For example:

- **Intuit (TurboTax):**  
  Intuit’s TurboTax terms explicitly prohibit any “unauthorized access” to their platform, including scraping or downloading content that isn’t owned by the user. This means using bots or scripts to extract data without proper permission is not allowed (https://www.intuit.com/legal/terms/en-us/turbotax/online-license/).

- **H&R Block:**  
  H&R Block’s online services agreement similarly bans any automated data extraction. Users are not permitted to harvest, scrape, or use any automated means (such as bots or spiders) to access their software and services or extract data, especially personal information, without consent (https://assist.hrblockonline.ca/TermsAndConditions.html#:~:text=If%20you%20breach%20any%20of,may%20be%20available%20to%20Block).

- **TaxAct:**  
  TaxAct’s Terms of Service include clear restrictions against using data mining, robots, artificial intelligence, or similar methods to extract data from their website. Their services are intended solely for personal tax preparation, and automated extraction of content such as Q&A is considered a breach of these terms (https://www.taxact.com/terms-of-service).

### Synthetic Data vs. RAG Approach

Another option we explored was generating synthetic data. Although synthetic data creation could have provided the necessary material, we encountered significant challenges—such as bias, hallucinations, and data quality concerns—which made this approach less attractive.

Ultimately, we determined that a Retrieval-Augmented Generation (RAG) approach would be the best course of action. Rather than fine-tuning on potentially limited or noisy data, the RAG framework allows us to map final answers directly to verified reference materials. This method not only saves valuable development time—since fine-tuning requires hand-creating and curating datasets—but also ensures that our model’s outputs are traceable to reliable sources.

### Model Integration and Architecture

Since most of the models we can run are relatively small or quantized, incorporating a larger model into our workflow allows us to add significant value. Our approach involves using multiple architectures to extract tax-related questions and verify their answers. This ensemble method helps counteract individual model biases and provides a more robust foundation for the final expert-level tax advisor model that we plan to fine-tune.

### Tuning for Context and Bias Reduction

For tuning, we are planning to focus on improving contextual awareness and removing bias by extracting information from a variety of sources. As part of our quality assurance process, we reviewed a random sample of 32 tax-related questions to check for any immediate and major hallucinations—and found none.



### Data Collection Responses

Below is an organized summary of the responses received from various models during our evaluation process. Each model’s output has been annotated with a placeholder to indicate where the detailed responses will be integrated:

#### GPT4o (Online Search Feature):

Prompt to GPT4o using online scraping feature: Research online and give me the top 20 tax questions that most people ask, along with the answers to those questions. 

##### Response:

```

Navigating the complexities of the U.S. tax system can be challenging. Below are 20 common tax questions along with concise answers to help clarify these topics:

1. **How can I reduce my tax bill?**
   - To lower your tax liability, consider maximizing deductions and credits, contributing to retirement accounts, and utilizing tax-efficient investments. Consult a tax professional for personalized strategies. citeturn0search8

2. **What deductions am I eligible for?**
   - Eligibility for deductions varies based on individual circumstances. Common deductions include mortgage interest, state and local taxes, charitable contributions, and medical expenses exceeding a certain threshold. citeturn0search8

3. **What's the difference between marginal and effective tax rates?**
   - The marginal tax rate is the rate applied to your next dollar of income, while the effective tax rate is the average rate you pay on your total taxable income. citeturn0search8

4. **Which is better: a tax credit or a tax deduction?**
   - A tax credit directly reduces your tax liability dollar-for-dollar, whereas a deduction lowers your taxable income. Generally, credits are more beneficial. citeturn0search8

5. **Can I deduct medical expenses?**
   - Yes, you can deduct unreimbursed medical expenses that exceed 7.5% of your adjusted gross income, provided you itemize your deductions. citeturn0search8

6. **Should I itemize deductions or take the standard deduction?**
   - You should itemize if your allowable deductions exceed the standard deduction for your filing status. Otherwise, the standard deduction may be more beneficial. citeturn0search8

7. **How do tax brackets work?**
   - The U.S. tax system is progressive; income is taxed at increasing rates as it moves into higher brackets. Each bracket applies to a specific range of income. citeturn0search5

8. **What is the Earned Income Tax Credit (EITC) and how do I claim it?**
   - The EITC is a refundable credit for low to moderate-income earners. Eligibility depends on income and number of qualifying children. To claim, file a tax return even if you owe no tax. citeturn0search1

9. **Are unemployment benefits taxable?**
   - Yes, unemployment benefits are generally considered taxable income at the federal level and may also be taxable at the state level. citeturn0search1

10. **How are my investment gains taxed?**
    - Short-term capital gains (on assets held less than a year) are taxed as ordinary income, while long-term gains benefit from lower tax rates. citeturn0search0

11. **Do I pay taxes on investments if I don't sell them?**
    - Generally, you pay taxes on dividends and interest earned, even if you don't sell the investment. Capital gains taxes apply only upon sale. citeturn0search0

12. **What income is taxable?**
    - Taxable income includes wages, salaries, bonuses, interest, dividends, rental income, and other earnings. Some income, like certain Social Security benefits, may be partially taxable. citeturn0search0

13. **When are taxes due?**
    - Federal income tax returns are typically due on April 15. If this date falls on a weekend or holiday, the deadline is the next business day. citeturn0search2

14. **What happens if I don't file my taxes?**
    - Failure to file can result in penalties, interest on unpaid taxes, and potential legal action. It's advisable to file even if you can't pay immediately. citeturn0search2

15. **How do I notify the IRS of an address change?**
    - You can inform the IRS by filing Form 8822, using your new address on your tax return, or sending a signed written statement to the address where you filed your last return. citeturn0search7

16. **Can I claim my boyfriend/girlfriend as a dependent?**
    - Possibly, if they meet the criteria for a qualifying relative, including residency, income limits, and you providing over half of their support. citeturn0search1

17. **Are moving expenses tax-deductible?**
    - As of the Tax Cuts and Jobs Act of 2017, moving expenses are generally not deductible, except for active-duty military members moving due to a military order. citeturn0search1

18. **How do I check the status of my tax refund?**
    - You can check your refund status using the "Where's My Refund?" tool on the IRS website or via the IRS2Go mobile app. citeturn0news9

19. **What is the Alternative Minimum Tax (AMT)?**
    - The AMT is a parallel tax system ensuring that high-income individuals pay a minimum amount of tax, disallowing certain deductions and credits. citeturn0search10

20. **How long should I keep tax records?**
    - It's recommended to keep tax records for at least three years from the date you filed your return or two years from the date you paid the tax, whichever is later. citeturn0search10
```

#### ChatGPT o3-mini-high:  

Prompt to ChatGPT o3-mini-high: give me 20 more of the tax question. Some deep ones. search online or just if you know them

##### Response:

```
| Question | Answer |
|----------|--------|
| **What is the difference between tax avoidance and tax evasion?** | Tax avoidance is the legal use of tax laws, deductions, credits, and strategies to minimize tax liability. Tax evasion, on the other hand, is the illegal act of misrepresenting or concealing information to reduce taxes owed. |
| **How does the Alternative Minimum Tax (AMT) work for high-income taxpayers?** | The AMT is a parallel tax system designed to ensure that high-income taxpayers pay a minimum tax. It limits or disallows many deductions and credits available under the regular tax system, requiring taxpayers to calculate their liability under both methods and pay the higher amount. |
| **How do tax treaties mitigate double taxation on international income?** | Tax treaties are agreements between countries that allocate taxing rights on cross-border income. They help prevent the same income from being taxed twice by providing methods such as tax credits, exemptions, or reduced withholding rates on dividends, interest, and royalties. |
| **How is cost basis determined for capital gains, and what adjustments are commonly made?** | The cost basis is generally the original purchase price plus any associated acquisition costs. Adjustments may include depreciation, stock splits, reinvested dividends, or return of capital. The adjusted basis is used to compute the gain or loss when the asset is sold. |
| **What are the implications of the passive activity loss rules for rental real estate investments?** | Passive activity loss rules restrict the deduction of losses from rental and other passive activities against non-passive income. Losses that exceed the limit cannot be deducted in the current year but may be carried forward to offset future passive income or become deductible when the activity is sold. |
| **How are self-employment taxes calculated, and what strategies can reduce their impact?** | Self-employment taxes are calculated on net earnings at a combined rate of 15.3% (12.4% for Social Security and 2.9% for Medicare). Taxpayers can reduce the impact by deducting half of the self-employment tax as an adjustment to income and contributing to retirement plans that lower taxable income. |
| **What is the Qualified Business Income (QBI) deduction and what limitations apply?** | The QBI deduction allows eligible owners of pass-through entities to deduct up to 20% of their qualified business income. Limitations include income thresholds, restrictions for specified service trades or businesses, and additional requirements involving W-2 wages and the basis of qualified property. |
| **How does the tax treatment of cryptocurrency transactions differ from traditional investments?** | Cryptocurrencies are treated as property for tax purposes. This means that transactions such as sales or exchanges trigger capital gains or losses calculated based on the difference between the cost basis and the sale price. Taxpayers must keep detailed records, and special rules may apply for frequent trading or mining activities. |
| **What are the tax implications of stock options, and how do ISOs differ from NSOs?** | Incentive Stock Options (ISOs) can qualify for favorable tax treatment if holding requirements are met, potentially resulting in capital gains on sale. However, exercising ISOs may trigger the AMT. Non-Qualified Stock Options (NSOs) are taxed as ordinary income at the time of exercise, based on the difference between the exercise price and the fair market value. |
| **How does depreciation recapture affect the sale of depreciated property?** | When selling depreciated property, the portion of the gain attributable to depreciation deductions previously taken is "recaptured" and taxed as ordinary income rather than at the lower capital gains rate. This rule applies mainly to business or rental property and ensures that depreciation benefits are partially reversed. |
| **How does a Section 1031 like-kind exchange work and what are its limitations post-TCJA?** | A Section 1031 exchange allows taxpayers to defer capital gains tax on the sale of certain investment properties by reinvesting the proceeds in similar (like-kind) property. Post-TCJA, the exchange is limited to real property and must adhere to strict identification and timing rules to qualify for deferral. |
| **What are the tax implications of receiving foreign income and using the Foreign Tax Credit?** | U.S. taxpayers must report their worldwide income, including foreign income. The Foreign Tax Credit helps prevent double taxation by allowing a credit for taxes paid to a foreign government. The credit is limited to the proportion of U.S. tax attributable to foreign income, and any excess foreign taxes may be carried back or forward. |
| **How does the cancellation of debt (COD) income work and what is the insolvency exception?** | Generally, canceled debt is considered taxable income. However, if a taxpayer is insolvent—meaning their liabilities exceed their assets—they may exclude canceled debt from income up to the amount of insolvency. Proper documentation and filing of Form 982 are required to claim this exception. |
| **How do installment sales work for deferring gain on property sales?** | An installment sale allows the seller to spread the recognition of gain over the period in which the payments are received. Each installment includes a portion of the gain based on the gross profit ratio. This method can help spread out tax liability over several years, potentially lowering overall tax rates. |
| **What is the doctrine of economic substance and how does it affect tax planning?** | The economic substance doctrine requires that transactions have a substantial non-tax purpose beyond merely generating tax benefits. If a transaction is found to lack economic substance, the IRS may disregard it for tax purposes, which can lead to disallowed deductions or credits and potential penalties. |
| **How do tax loss carryforwards and carrybacks work, and what limitations exist?** | Tax loss carryforwards allow a taxpayer to apply a net operating loss (NOL) to offset taxable income in future years. Some losses may also be carried back to prior years for a refund. Limitations include percentage caps on how much of a loss can be used in a given year and expiration periods for carrying losses forward. |
| **What are the tax implications of converting a business entity, such as from an LLC to an S corporation?** | Converting a business entity can trigger tax consequences such as recognition of gain on appreciated assets, changes in self-employment tax treatment, and adjustments in the allocation of income and losses among owners. The conversion process must be planned carefully to minimize unintended tax liabilities. |
| **How does the treatment of charitable contributions differ when itemizing deductions versus taking the standard deduction?** | When itemizing, taxpayers can deduct qualified charitable contributions subject to AGI limits and documentation requirements. Those taking the standard deduction cannot separately deduct charitable contributions unless there is a special above-the-line provision available. |
| **How are fringe benefits taxed, and what exceptions exist?** | Fringe benefits are generally taxable to employees unless specifically excluded by law. Common exceptions include employer-provided health insurance, certain transportation benefits, and de minimis benefits (small perks with minimal value). The IRS provides specific valuation rules for various benefits. |
| **How does the tax treatment differ for nonresident aliens compared to U.S. residents?** | Nonresident aliens are typically taxed only on income that is effectively connected with a U.S. trade or business and on certain fixed, determinable, annual, or periodic (FDAP) U.S.-source income, often at a flat withholding rate. U.S. residents are taxed on their worldwide income and face different filing requirements and deductions. |
```



#### Claude 3.7 Sonnet: 


Prompt to Claude 3.7 Sonnet: give me 40 questions related to tax, not too unique ones and the answers to it in detail in a table


##### Response:

```
# 40 Common Tax Questions and Detailed Answers

| Question | Detailed Answer |
|----------|----------------|
| 1. What is the difference between a tax credit and a tax deduction? | A tax credit directly reduces your tax liability dollar-for-dollar. If you owe $1,000 in taxes and receive a $200 tax credit, your tax bill becomes $800. A tax deduction, on the other hand, reduces your taxable income. If you have a $1,000 deduction in the 22% tax bracket, it reduces your taxes by $220 (22% of $1,000). Credits are generally more valuable than deductions of the same amount. |
| 2. What is the standard deduction for 2024? | For the 2024 tax year (filing in 2025), the standard deduction is $14,600 for single filers and married filing separately, $29,200 for married filing jointly, and $21,900 for heads of household. These amounts are adjusted annually for inflation. |
| 3. What is the difference between itemized deductions and the standard deduction? | The standard deduction is a fixed amount that reduces your taxable income based on your filing status. Itemized deductions are specific expenses you can deduct, such as mortgage interest, charitable donations, medical expenses, and state and local taxes (up to certain limits). You should choose whichever method gives you the larger deduction. |
| 4. How long should I keep tax records? | The IRS recommends keeping tax returns and supporting documents for at least 3 years after filing, as this is generally the statute of limitations for audits. However, if you underreport income by more than 25%, the IRS has 6 years to audit you. For employment tax records, keep them for at least 4 years. For property records, keep them until you dispose of the property plus 3-7 years. |
| 5. What is the capital gains tax rate? | For 2024, long-term capital gains (assets held over one year) are taxed at 0% for incomes up to $47,025 (single) or $94,050 (married filing jointly), 15% for incomes up to $518,900 (single) or $583,750 (married filing jointly), and 20% for incomes above those thresholds. Short-term capital gains (assets held one year or less) are taxed as ordinary income. |
| 6. How do I calculate my tax bracket? | Tax brackets are based on your taxable income (after deductions and adjustments). For 2024, there are seven federal income tax brackets: 10%, 12%, 22%, 24%, 32%, 35%, and 37%. Your taxable income falls into specific brackets, with higher rates applying only to the portion of income that exceeds each threshold, creating a progressive tax system. |
| 7. What is the deadline for filing taxes? | For most individuals, federal income tax returns are due by April 15th each year. If April 15th falls on a weekend or holiday, the deadline shifts to the next business day. You can request a six-month extension using Form 4868, but this only extends the filing deadline—not the deadline for paying any taxes owed. |
| 8. What is the penalty for filing taxes late? | The penalty for filing late is typically 5% of the unpaid taxes for each month or part of a month that the return is late, up to 25%. If you file more than 60 days late, the minimum penalty is $485 (for returns due in 2024) or 100% of your unpaid tax, whichever is smaller. There's also a separate penalty for paying late: 0.5% of unpaid taxes per month, up to 25%. |
| 9. What tax forms do I need to file? | Most individuals use Form 1040 for federal income taxes. Additional schedules may be required depending on your situation: Schedule A for itemized deductions, Schedule B for interest and dividends, Schedule C for self-employment income, Schedule D for capital gains and losses, Schedule E for rental income, and Schedule SE for self-employment tax calculation. |
| 10. How do I check my tax refund status? | You can check your federal tax refund status using the "Where's My Refund?" tool on the IRS website or the IRS2Go mobile app. You'll need your Social Security number, filing status, and exact refund amount. Most refunds are issued within 21 days of electronic filing, though some may take longer due to errors, identity verification, or certain tax credits claimed. |
| 11. What is the Earned Income Tax Credit (EITC)? | The EITC is a refundable tax credit for low to moderate-income working individuals and couples, particularly those with children. For 2024, the maximum credit ranges from $600 (no qualifying children) to $7,430 (three or more qualifying children). Eligibility and credit amount depend on filing status, income, and number of qualifying children. |
| 12. Are unemployment benefits taxable? | Yes, unemployment benefits are generally considered taxable income at the federal level. You should receive Form 1099-G showing the amount of unemployment compensation you received during the tax year. However, state taxation of unemployment benefits varies by state. |
| 13. How do I report cryptocurrency transactions on my taxes? | Cryptocurrency transactions are reported on Schedule D and Form 8949. You must report any capital gains or losses from buying, selling, or exchanging cryptocurrency. Mining rewards and receiving crypto as payment for goods or services are typically reported as ordinary income. The IRS considers crypto as property for tax purposes, not currency. |
| 14. What is the Child Tax Credit? | The Child Tax Credit for 2024 is worth up to $2,000 per qualifying child under age 17, with up to $1,600 being refundable (as the Additional Child Tax Credit). The credit begins to phase out for single filers with modified AGI above $200,000 and married filing jointly above $400,000. |
| 15. What is the tax treatment for Health Savings Accounts (HSAs)? | HSAs offer triple tax advantages: contributions are tax-deductible, growth is tax-free, and withdrawals for qualified medical expenses are tax-free. For 2024, the contribution limit is $4,150 for individual coverage and $8,300 for family coverage, with an additional $1,000 catch-up contribution allowed for those 55 or older. |
| 16. How do 401(k) contributions affect my taxes? | Traditional 401(k) contributions are made pre-tax, reducing your taxable income for the year. For example, if you earn $60,000 and contribute $6,000 to your 401(k), your taxable income becomes $54,000. The 2024 contribution limit is $23,000, with an additional $7,500 catch-up contribution allowed for those 50 or older. Roth 401(k) contributions are made after-tax but grow tax-free. |
| 17. What home expenses are tax deductible? | For most homeowners, mortgage interest on up to $750,000 of acquisition debt (for homes purchased after Dec. 15, 2017) and property taxes (up to $10,000 combined with state and local income taxes) are deductible if you itemize. Home office expenses may be deductible for self-employed individuals using the exclusive area regularly for business. |
| 18. How are Social Security benefits taxed? | Social Security benefits may be partially taxable depending on your "combined income" (adjusted gross income + nontaxable interest + half of your Social Security benefits). For individuals with combined income between $25,000 and $34,000, up to 50% of benefits may be taxable. Above $34,000, up to 85% may be taxable. For married filing jointly, the thresholds are $32,000 and $44,000. |
| 19. What is the Gift Tax exclusion amount? | For 2024, you can give up to $18,000 per recipient per year without filing a gift tax return or counting against your lifetime estate and gift tax exemption. Married couples can split gifts, allowing up to $36,000 per recipient annually. Payments made directly to educational institutions or medical providers are exempt from gift tax regardless of amount. |
| 20. How do I handle taxes for a side gig or freelance work? | Income from side gigs or freelance work is reported on Schedule C. You're responsible for paying self-employment tax (15.3% for Social Security and Medicare) on net earnings over $400, reported on Schedule SE. You may need to make quarterly estimated tax payments using Form 1040-ES if you expect to owe $1,000 or more in taxes. Keep detailed records of business expenses for potential deductions. |
| 21. What education expenses are tax deductible or eligible for credits? | The American Opportunity Credit provides up to $2,500 per eligible student for the first four years of higher education. The Lifetime Learning Credit offers up to $2,000 per tax return for undergraduate, graduate, or professional courses. Student loan interest (up to $2,500) may be deductible. Qualified tuition and related expenses may be deductible under the Tuition and Fees Deduction. |
| 22. How does marriage affect taxes? | Marriage can create a "marriage bonus" or "marriage penalty" depending on your incomes. When spouses have similar incomes, especially if both are high earners, they may pay more tax filing jointly than they would as singles ("marriage penalty"). When incomes are disparate, joint filing often results in lower taxes ("marriage bonus"). Marriage affects tax brackets, standard deduction, and various credits and deductions. |
| 23. What is the Alternative Minimum Tax (AMT)? | The AMT is a parallel tax system designed to ensure that taxpayers with substantial income don't avoid paying taxes through deductions and credits. It has its own set of rules for calculating taxable income with fewer available deductions. You calculate tax under both regular and AMT systems and pay the higher amount. The AMT exemption for 2024 is $85,700 for singles and $133,300 for married filing jointly. |
| 24. How are dividends taxed? | Qualified dividends are taxed at the same preferential rates as long-term capital gains (0%, 15%, or 20%, depending on your income). Non-qualified dividends are taxed as ordinary income. Dividends from most U.S. corporations and certain foreign corporations are generally qualified if you've held the stock for more than 60 days during the 121-day period beginning 60 days before the ex-dividend date. |
| 25. What are the tax implications of working remotely in a different state? | Working remotely across state lines can create multi-state tax obligations. You may owe income tax to both your state of residence and the state where your employer is located, though tax credits typically prevent double taxation. Some states have "convenience of employer" rules that tax nonresident telecommuters. During COVID-19, some states implemented temporary relief provisions. |
| 26. What charitable donations are tax deductible? | Donations to qualified tax-exempt organizations (churches, nonprofit educational organizations, etc.) are deductible if you itemize. For cash donations, you can deduct up to 60% of your adjusted gross income (AGI). For property donations, the limit is typically 30% of AGI. Always get a receipt for donations of $250 or more, and special forms may be required for donations over $500. |
| 27. How do I claim a home office deduction? | Self-employed individuals can claim home office deductions using either the regular method (calculating actual expenses) or simplified method ($5 per square foot, up to 300 square feet). The space must be used regularly and exclusively for business. W-2 employees generally cannot claim home office deductions for tax years 2018-2025 due to the suspension of the miscellaneous itemized deductions. |
| 28. What is depreciation recapture? | Depreciation recapture occurs when you sell a depreciable asset (like rental property) for more than its depreciated value. The portion of your gain attributable to depreciation is taxed as ordinary income, up to 25%, rather than at capital gains rates. This prevents taxpayers from claiming ordinary deductions while receiving the benefit of lower capital gains rates on sale. |
| 29. How do I report rental income and expenses? | Rental income and expenses are reported on Schedule E. Deductible expenses include mortgage interest, property taxes, insurance, maintenance, utilities, management fees, and depreciation of the property. Rental losses may be deductible up to $25,000 against other income if your modified AGI is less than $100,000 and you actively participate in the rental activity. |
| 30. What is the Net Investment Income Tax (NIIT)? | The NIIT is a 3.8% tax on net investment income (interest, dividends, capital gains, rental income, royalties, etc.) that applies to individuals with modified adjusted gross income exceeding $200,000 (single) or $250,000 (married filing jointly). It's calculated on the lesser of your net investment income or the amount by which your MAGI exceeds the threshold. |
| 31. How are retirement account withdrawals taxed? | Traditional IRA and 401(k) withdrawals are generally taxed as ordinary income. Roth IRA qualified distributions (after age 59½ and account open for 5+ years) are tax-free. Early withdrawals (before age 59½) may incur a 10% penalty in addition to regular income tax, though exceptions exist. Required Minimum Distributions (RMDs) must begin at age 73 (as of 2024) for traditional accounts. |
| 32. What is the tax treatment for stock options? | For Non-Qualified Stock Options (NQSOs), you pay ordinary income tax on the difference between the grant price and exercise price when you exercise the option. For Incentive Stock Options (ISOs), you generally don't pay tax upon exercise, but the difference may trigger Alternative Minimum Tax. When you sell the stock, you'll pay capital gains tax on any additional appreciation. |
| 33. How do I deduct medical expenses? | Medical expenses exceeding 7.5% of your adjusted gross income are deductible if you itemize. Eligible expenses include costs for diagnosis, treatment, prevention, insurance premiums (if not pre-tax), prescription drugs, and certain long-term care expenses. Cosmetic procedures, non-prescription drugs, and general health products typically don't qualify. |
| 34. What is the Saver's Credit? | The Saver's Credit (officially the Retirement Savings Contributions Credit) provides a tax credit of up to $1,000 ($2,000 for married filing jointly) for low and moderate-income taxpayers who contribute to retirement accounts. For 2024, the credit phases out for single filers with AGI above $24,000 and married filing jointly above $48,000. The credit amount is 10%, 20%, or 50% of contributions, depending on income. |
| 35. How do I report stock sales on my tax return? | Stock sales are reported on Schedule D and Form 8949. Your broker should provide Form 1099-B with details of your transactions. You'll need to report the sale date, purchase date, cost basis, sale proceeds, and gain or loss. Holding periods determine whether gains are short-term (ordinary income rates) or long-term (preferential capital gains rates). |
| 36. What is the Foreign Tax Credit? | The Foreign Tax Credit helps U.S. taxpayers avoid double taxation on foreign-source income by providing a credit for foreign taxes paid. You can either take a deduction for foreign taxes paid (on Schedule A) or a credit (using Form 1116, though Form 1040 filers with only passive foreign income may qualify for a simplified procedure). The credit is generally more beneficial than the deduction. |
| 37. What are the tax implications of cancellation of debt? | Cancelled debt is generally considered taxable income. If a creditor forgives or cancels $600 or more of debt, they'll issue Form 1099-C. However, exceptions exist for certain bankruptcy cases, insolvency, qualified principal residence indebtedness, qualified farm debt, and certain student loans. The tax treatment depends on your specific circumstances and the type of debt. |
| 38. How do I report gambling winnings and losses? | All gambling winnings are reportable as "Other Income" on Form 1040. Casinos and other payers may provide Form W-2G for certain winnings. Gambling losses are deductible only if you itemize deductions and only up to the amount of your winnings. Keep detailed records of both winnings and losses, including dates, locations, types of gambling, and amounts won or lost. |
| 39. What is the tax treatment for Health Flexible Spending Arrangements (FSAs)? | Health FSA contributions are made pre-tax through employer plans, reducing your taxable income. For 2024, the contribution limit is $3,200. Unlike HSAs, FSAs generally follow a "use it or lose it" rule, though employers may offer either a grace period of up to 2.5 months or allow carrying over up to $640 to the next year. Withdrawals for qualified medical expenses are tax-free. |
| 40. How are inherited assets taxed? | Inherited assets generally receive a "step-up" in basis to their fair market value on the date of the original owner's death, eliminating capital gains tax on appreciation that occurred during the deceased's lifetime. When you eventually sell the asset, you'll only pay capital gains tax on appreciation since inheritance. Some inherited retirement accounts have specific required distribution rules that affect taxation. |
```


#### Llama 3.1-405B:


Prompt to  Llama 3.1-405B: give me 40 questions related to tax, not too unique ones and the answers to it in detail in a table, be detailed

##### Response:

```
Here are the 40 tax-related questions with detailed answers:
Tax-Related Questions and Answers

Question #	Question	Answer
1	What is the deadline for filing individual tax returns?	The deadline for filing individual tax returns is April 15th of each year, unless an extension is filed using Form 4868, which grants an automatic six-month extension to October 15th.
| 2 | What is the standard deduction for single filers? | The standard deduction for single filers is $13,850 for tax year 2023. This amount is adjusted annually for inflation. |
| 3 | What is the standard deduction for married couples filing jointly? | The standard deduction for married couples filing jointly is $27,700 for tax year 2023. This amount is also adjusted annually for inflation. |
| 4 | What is the tax rate for long-term capital gains? | Long-term capital gains are taxed at 0%, 15%, or 20% depending on income level. For tax year 2023, the 0% rate applies to taxable income up to $44,625 for single filers and $89,250 for joint filers. The 15% rate applies to taxable income between $44,626 and $492,300 for single filers and $89,251 and $553,850 for joint filers. The 20% rate applies to taxable income above $492,300 for single filers and $553,850 for joint filers. |
| 5 | Can I deduct mortgage interest on my tax return? | Yes, you can deduct mortgage interest on your primary residence and/or second home, up to a total of $750,000 ($375,000 for married filing separately). You must itemize deductions on Schedule A to claim this deduction. |
| 6 | What is the earned income tax credit (EITC)? | The EITC is a refundable tax credit for low-income working individuals and families. For tax year 2023, the credit ranges from $538 to $6,728 depending on income level, filing status, and number of qualifying children. |
| 7 | Can I claim a dependent on my tax return? | Yes, you can claim a dependent on your tax return if they meet certain tests, such as the qualifying child or qualifying relative test. You may be eligible for the dependent exemption, child tax credit, or other benefits. |
| 8 | What is the child tax credit? | The child tax credit is a non-refundable tax credit of up to $2,000 per child under age 17. The credit begins to phase out at $400,000 for joint filers and $200,000 for single filers. |
| 9 | Do I need to file a tax return if I'm self-employed? | Yes, you need to file a tax return if your net earnings from self-employment are $400 or more. You'll report your business income and expenses on Schedule C and pay self-employment tax on your net earnings. |
| 10 | What is the self-employment tax rate? | The self-employment tax rate is 15.3% of net earnings from self-employment. This includes 12.4% for Social Security and 2.9% for Medicare. You'll report self-employment tax on Schedule SE. |
| 11 | Can I deduct business expenses on my tax return? | Yes, you can deduct business expenses on your tax return if you're self-employed or have a side hustle. You'll report business expenses on Schedule C and may need to complete additional forms, such as Form 8829 for the home office deduction. |
| 12 | What is the home office deduction? | The home office deduction allows you to deduct a portion of your rent or mortgage interest and utilities if you use a dedicated space for business. You can use the simplified option, which allows a deduction of $5 per square foot of home office space, up to a maximum of $1,500. |
| 13 | Can I deduct charitable donations on my tax return? | Yes, you can deduct charitable donations on your tax return if you itemize deductions on Schedule A. You'll need to keep receipts or bank records to substantiate your donations. |
| 14 | What is the limit on state and local tax (SALT) deductions? | The limit on SALT deductions is $10,000 ($5,000 for married filing separately). This includes state income taxes, local property taxes, and other taxes. |
| 15 | Can I deduct medical expenses on my tax return? | Yes, you can deduct medical expenses on your tax return if you itemize deductions on Schedule A. You can deduct expenses that exceed 10% of your adjusted gross income (AGI). |
| 16 | What is the alternative minimum tax (AMT)? | The AMT is a separate tax calculation that ensures a minimum amount of tax is paid. You'll complete Form 6251 to determine if you owe AMT. |
| 17 | Can I claim a deduction for education expenses? | Yes, you can claim a deduction for education expenses if you're pursuing higher education and meet certain requirements. You may be eligible for the American Opportunity Tax Credit or the Lifetime Learning Credit. |
| 18 | What is the student loan interest deduction? | The student loan interest deduction allows you to deduct up to $2,500 of interest paid on student loans. This deduction is subject to income limits and is claimed on Form 1040. |
| 19 | Can I deduct moving expenses on my tax return? | Yes, you can deduct moving expenses on your tax return if you're moving for a job or business and meet certain requirements. You'll complete Form 3903 to claim this deduction. |
| 20 | What is the tax rate for short-term capital gains? | Short-term capital gains are taxed as ordinary income, using the same tax rates as your regular income. |
| 21 | Can I deduct property taxes on my tax return? | Yes, you can deduct property taxes on your primary residence and/or second home, up to a total of $10,000 ($5,000 for married filing separately). You must itemize deductions on Schedule A to claim this deduction. |
| 22 | What is the qualified business income (QBI) deduction? | The QBI deduction allows you to deduct up to 20% of qualified business income from a partnership, S corporation, or sole proprietorship. This deduction is subject to income limits and is claimed on Form 1040. |
| 23 | Can I deduct retirement contributions on my tax return? | Yes, you can deduct contributions to a traditional IRA or 401(k) on your tax return. These contributions may be subject to income limits and are claimed on Form 1040. |
| 24 | What is the required minimum distribution (RMD) age? | The RMD age is 72 years old, unless you're still working for an employer sponsoring a retirement plan. You'll need to take RMDs from traditional IRAs and 401(k)s. |
| 25 | Can I deduct alimony payments on my tax return? | Yes, you can deduct alimony payments on your tax return if you're the payer and meet certain requirements. This deduction is claimed on Form 1040. |
| 26 | What is the tax rate for dividends? | Dividends are taxed at 0%, 15%, or 20% depending on income level. For tax year 2023, the 0% rate applies to taxable income up to $44,625 for single filers and $89,250 for joint filers. The 15% rate applies to taxable income between $44,626 and $492,300 for single filers and $89,251 and $553,850 for joint filers. The 20% rate applies to taxable income above $492,300 for single filers and $553,850 for joint filers. |
| 27 | Can I deduct business use of my car on my tax return? | Yes, you can deduct business use of your car on your tax return if you're self-employed or have a side hustle. You can use the standard mileage rate or actual expenses. |
| 28 | What is the standard mileage rate for business use? | The standard mileage rate for business use is 58.5 cents per mile for tax year 2023. You can also deduct tolls and parking fees. |
| 29 | Can I deduct home improvements on my tax return? | Yes, you can deduct home improvements on your tax return if they're medically necessary or increase the value of your home. You'll need to keep receipts and records to substantiate your expenses. |
| 30 | What is the tax credit for child and dependent care? | The tax credit for child and dependent care is a non-refundable tax credit of up to $3,000 for one child or $6,000 for two or more. This credit is claimed on Form 2441. |
| 31 | Can I deduct state taxes on my tax return? | Yes, you can deduct state taxes on your tax return if you itemize deductions on Schedule A. This includes state income taxes, local property taxes, and other taxes. |
| 32 | What is the tax rate for interest income? | Interest income is taxed as ordinary income, using the same tax rates as your regular income. |
| 33 | Can I deduct investment expenses on my tax return? | Yes, you can deduct investment expenses on your tax return if you itemize deductions on Schedule A. This includes fees for investment advice, management, and other expenses. |
| 34 | What is the tax credit for education expenses? | The tax credit for education expenses is a non-refundable credit of up to $2,500 for qualified expenses. It includes the American Opportunity Tax Credit (AOTC), which covers 100% of the first $2,000 and 25% of the next $2,000, and the Lifetime Learning Credit (LLC), which covers 20% of up to $10,000. Income limits apply, and you can claim only one credit per student per year. To claim the credit, complete Form 8863.
35	Can I deduct business travel expenses on my tax return?	Yes, you can deduct business travel expenses on your tax return if you're self-employed or have a side hustle. You'll need to keep receipts and records to substantiate your expenses.
| 36 | What is the per diem rate for business travel? | The per diem rate for business travel varies by location, but $60 per day for meals and incidentals is a common rate. You can use the IRS's per diem rates or actual expenses. |
| 37 | Can I deduct business entertainment expenses on my tax return? | Yes, you can deduct business entertainment expenses on your tax return if you're self-employed or have a side hustle. However, only 50% of these expenses are deductible. |
| 38 | What is the tax rate for capital gains on collectibles? | Capital gains on collectibles, such as art, coins, or stamps, are taxed at a rate of 28%. |
| 39 | Can I deduct business liability insurance premiums on my tax return? | Yes, you can deduct business liability insurance premiums on your tax return if you're self-employed or have a side hustle. This includes premiums for professional liability, business liability, and other types of insurance. |
| 40 | What is the tax credit for retirement savings contributions? | The tax credit for retirement savings contributions is a non-refundable tax credit of up to $2,000 for single filers and $4,000 for joint filers. This credit is claimed on Form 8880. |


Important Notes
These answers are based on federal tax laws and regulations for the 2023 tax year.
Tax laws and regulations are subject to change, so it's always best to consult with a tax professional or the IRS for the most up-to-date information.
These questions and answers are meant to provide general guidance and may not apply to your specific situation.
Always consult with a tax professional or financial advisor to ensure you're in compliance with tax laws and regulations.
Additional Resources
IRS Website: (link unavailable)
IRS Taxpayer Assistance Centers: (link unavailable)
Tax Professionals: Find a certified public accountant (CPA) or enrolled agent (EA) in your area.
Tax Software: Utilize tax software, such as TurboTax or H&R Block, to prepare and file your tax return.
```




All of this is then cleaned and compilled into a Google Docs, which then converts into a csv, the csv is called `tax-questions-answers.csv`. 


In [None]:
import pandas as pd

q_and_a = pd.read_csv(os.path.join(DATA_DIR,'tax-questions-answers.csv'))
q_and_a.head()

Unnamed: 0,Source,ID,Question,Answer
0,ChatGPT GPT4o,1,How can I reduce my tax bill?,"To lower your tax liability, consider maximizi..."
1,ChatGPT GPT4o,2,What deductions am I eligible for?,Eligibility for deductions varies based on ind...
2,ChatGPT GPT4o,3,What's the difference between marginal and eff...,The marginal tax rate is the rate applied to y...
3,ChatGPT GPT4o,4,Which is better: a tax credit or a tax deduction?,A tax credit directly reduces your tax liabili...
4,ChatGPT GPT4o,5,Can I deduct medical expenses?,"Yes, you can deduct unreimbursed medical expen..."


In [15]:
q_and_a['Source'].unique()

array(['ChatGPT GPT4o', 'ChatGPT o3-mini-high', 'Claude 3.7 Sonnet ',
       'Llama 3.1-405B'], dtype=object)

Since we will be using RAG with the tax documentation, we will pull the PDFs directly and process them.

### RAG

#### Reading PDFs from IRS

In [None]:


IRS_URL = "https://www.irs.gov/instructions"

try:
    response = requests.get(IRS_URL, timeout=10)
    response.raise_for_status() 
except requests.RequestException as e:
    raise Exception(f"Failed to fetch IRS instructions page: {e}")

soup = BeautifulSoup(response.text, "html.parser")
pdf_texts = []

for link in soup.find_all("a", href=True):
    href = link["href"]
    if href.lower().endswith(".pdf"):
        pdf_url = urljoin(IRS_URL, href)
        print(f"Processing PDF: {pdf_url}")

        try:
            pdf_response = requests.get(pdf_url, timeout=10)
            pdf_response.raise_for_status()
        except requests.RequestException as e:
            print(f"❌ Failed to fetch {pdf_url}: {e}")
            continue

        pdf_bytes = BytesIO(pdf_response.content)
        try:
            # Using PyMuPDF (fitz) to extract text as an alternative to pdfminer
            doc = fitz.open(stream=pdf_bytes, filetype="pdf")
            text = ""
            for page in doc:
                text += page.get_text()
            pdf_texts.append({"file_name": pdf_url.split("/")[-1], "content": text})
        except Exception as e:
            print(f"❌ Failed to process {pdf_url}: {e}")

print(f"✅ Processed {len(pdf_texts)} PDFs")


Processing PDF: https://www.irs.gov/pub/irs-pdf/i1040gi.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/pcir230.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i1040gi.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/pcir230.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i56.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i172.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i461.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i706.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i706a.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i706d.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i706gsd.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i706gsd1.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i706gst.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i706na.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i706qdt.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i709.pdf
Processing PDF: https://www.irs.gov/pub/irs-pdf/i720.pdf
Pro

### Converting to MD

In [None]:
pdf_markdown = [
    {"file_name": doc["file_name"].replace(".pdf", ".md"), "content": md(doc["content"])}
    for doc in pdf_texts
]

print(f"✅ Converted {len(pdf_markdown)} PDFs to Markdown format")

✅ Converted 409 PDFs to Markdown format


### Chunking & Storing Vectors

In [None]:

# Initialize chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# Chunk the Markdown documents
chunks = []
for doc in pdf_markdown:
    split_texts = text_splitter.split_text(doc["content"])
    for i, text in enumerate(split_texts):
        chunks.append({"file_name": doc["file_name"], "chunk_id": i, "content": text})

print(f"✅ Created {len(chunks)} text chunks")

✅ Created 68418 text chunks


### Pushing to Pinecone

In [None]:

# Initialize Pinecone
pc = pinecone.Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# Set index name
index_name = "tax-rag"

# Check if the index already exists
existing_indexes = [index.name for index in pc.list_indexes()]
if index_name not in existing_indexes:
    print(f"Creating new Pinecone index: {index_name}")
    pc.create_index(
        name=index_name,
        spec=pinecone.ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        ),
        dimension=384,
        metric="cosine"
    )

# Connect to the existing index
index = pc.Index(index_name)
print(f"✅ Connected to Pinecone index: {index_name}")

Creating new Pinecone index: tax-rag
✅ Connected to Pinecone index: tax-rag


In [10]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:

# Convert each chunk into an embedding
chunk_embeddings = [
    {
        "file_name": chunk["file_name"],
        "chunk_id": chunk["chunk_id"],
        "content": chunk["content"],
        "embedding": embedding_model.encode(chunk["content"]).tolist()
    }
    for chunk in chunks
]

# Store in Pinecone with batching
def batch_upsert(index, vectors, batch_size=100):
    """Uploads embeddings in batches to avoid Pinecone's request size limit."""
    for i in range(0, len(vectors), batch_size):
        batch = vectors[i : i + batch_size]  # Get batch slice
        index.upsert(batch)
        print(f"✅ Uploaded batch {i // batch_size + 1}/{(len(vectors) // batch_size) + 1}")

# Prepare vectors for Pinecone
vectors = [
    (f"{chunk['file_name']}_{chunk['chunk_id']}", chunk["embedding"], {"text": chunk["content"]})
    for chunk in chunk_embeddings
]

# Upload in batches
batch_upsert(index, vectors)

print("✅ All embeddings successfully stored in Pinecone!")

The cell executed successfully, but we accidentally deleted the output during the cleanup process. The complete successful run took two hours.

### Testing Embedding

In [20]:
def query_and_get_top_two(query, embedding_model, index):
    query_embedding = embedding_model.encode(query).tolist()
    results = index.query(vector=query_embedding, top_k=2, include_metadata=True)
    matches = results.get("matches", [])
    if len(matches) == 0:
        return ("", None, "", None)
    elif len(matches) == 1:
        return (matches[0]['metadata']['text'], matches[0]['score'], "", None)
    else:
        return (matches[0]['metadata']['text'], matches[0]['score'], 
                matches[1]['metadata']['text'], matches[1]['score'])

In [26]:
context1, score1, context2, score2 = query_and_get_top_two("Which forms do I need for self-employment?", embedding_model, index)


print("Top 1 Result:")
print("Text :", context1)
print("Score:", score1)

if context2:
    print("\nTop 2 Result:")
    print("Text :", context2)
    print("Score:", score2)


Top 1 Result:
Text : self-employment income from separate nonfarm or farm
businesses, each of you must complete and file a
separate Schedule C (Form 1040) or Schedule F (Form
1040). Be sure to enter at the top of each Schedule C
(Form 1040) or Schedule F (Form 1040) the name and
SSN of the spouse who owns the business. Each of you
must also complete a separate Schedule SE (Form 1040).
Attach these pages to a single Form 1040-SS.
Business Owned and Operated by
Spouses
Score: 0.604735672

Top 2 Result:
Text : Schedule SE (Form 1040), Self-Employment Tax, to complete your return.
You may only need to file Form 1040-SS and none of the schedules. However, if your return is more complicated (for
example, you claim certain deductions or credits or owe additional taxes), you will need to complete one or more of the
schedules. Below is a general guide to which schedule(s) you will need to file based on your circumstances. See the
Score: 0.598732054


### Updating the Fine-tunning Data

In [27]:
q_and_a[['Context1', 'Score1', 'Context2', 'Score2']] = q_and_a['Question'].apply(
    lambda q: pd.Series(query_and_get_top_two(q, embedding_model, index))
)

In [None]:
q_and_a.to_csv

Unnamed: 0,Source,ID,Question,Answer,Context1,Score1,Context2,Score2
0,ChatGPT GPT4o,1,How can I reduce my tax bill?,"To lower your tax liability, consider maximizi...","System at IRS.gov/SAMS.\nFor more information,...",0.566803,WV 26106-2188. Or you can enclose\nthe check w...,0.563434
1,ChatGPT GPT4o,2,What deductions am I eligible for?,Eligibility for deductions varies based on ind...,Standard Deduction (Group I Only)\nIf you do n...,0.668105,33\nStandard Deduction Worksheet for Dependent...,0.662116
2,ChatGPT GPT4o,3,What's the difference between marginal and eff...,The marginal tax rate is the rate applied to y...,"a rate for branch profits, the rate of tax is ...",0.553689,Taxable amount over\nColumn B\nTaxable amount ...,0.487213
3,ChatGPT GPT4o,4,Which is better: a tax credit or a tax deduction?,A tax credit directly reduces your tax liabili...,"deduction, or both. However, a practice that d...",0.579091,tax deduction recognized for U.S. taxable inco...,0.547606
4,ChatGPT GPT4o,5,Can I deduct medical expenses?,"Yes, you can deduct unreimbursed medical expen...","If you itemize, you can deduct a part of your ...",0.688144,Don't include on Schedule A items deducted els...,0.681987


`Breakpoint`

Run following cells to either save or load data at this point.


In [3]:
import os
import pandas as pd
DATA_DIR = "data"
q_and_a_path = os.path.join(DATA_DIR, "tax-questions-answers-with-context.csv")

In [32]:
q_and_a.to_csv(q_and_a_path, index=False)

In [4]:
q_and_a = pd.read_csv(q_and_a_path)

## Model Finetunning

Our initial discovery was to find a model that was close to our needs, so an extensive finetunning will not be required. 

We found two main candidates:

- Lawma 
- saul_7b_instruct


Lawma had a 8b and also 70b.  TODO:


After unable to run these to best of our capabiities, we decided to find alternative, this is also when deepseek just recently came out. We started tested with multiple different iterations of deep seek. We found the best one to be "DeepSeek-V2-Lite", which was light and instructut and worked very well with our capabilities. 


We got a server with two a100s.


the problems of other models were not solved by even fine tunning, and overall pipeline would not work very well. 


So we decided to with deepseek fine tunning using the data I had


In [7]:
pip install --upgrade bitsandbytes transformers peft accelerate datasets trl

Looking in indexes: https://pypi.netflix.net/simple
Collecting bitsandbytes
  Using cached https://pypi.netflix.net/packages/18487298685/bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting transformers
  Using cached https://pypi.netflix.net/packages/18469686031/transformers-4.49.0-py3-none-any.whl (10.0 MB)
Collecting peft
  Using cached https://pypi.netflix.net/packages/17715809385/peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Collecting accelerate
  Using cached https://pypi.netflix.net/packages/18469816810/accelerate-1.4.0-py3-none-any.whl (342 kB)
Collecting datasets
  Using cached https://pypi.netflix.net/packages/18478228733/datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting trl
  Using cached https://pypi.netflix.net/packages/18490342842/trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers)
  Using cached https://pypi.netflix.net/packages/18477366081/huggingface_hub-0.29.1-py3-none-any.w

In [9]:
import os
# Set CUDA device environment variables BEFORE importing any CUDA libraries.
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

import torch
torch.cuda.set_device(0)
print("Using GPU:", torch.cuda.current_device())

import pandas as pd
import warnings
warnings.filterwarnings('ignore')


Using GPU: 0


In [6]:
# -------------------------
# Data Preparation
# -------------------------
DATA_DIR = "data"
q_and_a_path = os.path.join(DATA_DIR, "tax-questions-answers-with-context.csv")

try:
    q_and_a
except NameError:
    q_and_a = pd.read_csv(q_and_a_path)

def construct_prompt(row):
    return f"""You are a helpful tax advisor and legal expert. Use the provided context to answer the user's query in a clear and concise manner.

User Query: {row['Question']}

Related Context:
{row['Context1']}
{row['Context2']}

Note: The above information is extracted from relevant forms or online sources. Use it to formulate your response."""
    
q_and_a["prompt"] = q_and_a.apply(construct_prompt, axis=1)
q_and_a["target"] = q_and_a["Answer"]

print("Data prepared: prompt and target columns added.")

from datasets import Dataset

# Convert the DataFrame to a Hugging Face dataset and perform a train/test split.
dataset = Dataset.from_pandas(q_and_a)
dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

print("Train and evaluation datasets created.")


Data prepared: prompt and target columns added.
Train and evaluation datasets created.


In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig

# Define model paths.
base_model_path = "./model_directory/models--deepseek-ai--DeepSeek-V2-Lite-Chat/snapshots/85864749cd611b4353ce1decdb286193298f64c7"
output_dir = "./model_directory/models--zainnobody--TaxSense/"
os.makedirs(output_dir, exist_ok=True)

# Load the tokenizer from the base model.
tokenizer = AutoTokenizer.from_pretrained(
    base_model_path,
    add_eos_token=True,
    use_fast=True,
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

# Set compute dtype and attention implementation.
if torch.cuda.is_bf16_supported():
    compute_dtype = torch.bfloat16
    attn_implementation = "flash_attention_2"
else:
    compute_dtype = torch.float16
    attn_implementation = "sdpa"

# Configure 4-bit quantization settings.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

# Load the base model using a device map that forces it onto the current CUDA device.
model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    quantization_config=bnb_config,
    device_map={"": torch.cuda.current_device()},
    trust_remote_code=True,
)
print("Base model loaded.")

# Prepare the model for k-bit training (QLoRA).
model = prepare_model_for_kbit_training(model)

# Optionally disable gradient checkpointing if it's causing issues.
if hasattr(model, "gradient_checkpointing_disable"):
    model.gradient_checkpointing_disable()
    print("Gradient checkpointing disabled.")

# Ensure the model is on GPU 0.
model = model.to("cuda:0")
print("Model is on GPU 0.")

# Configure LoRA.
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear",
)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Base model loaded.
Gradient checkpointing disabled.
Model is on GPU 0.


In [12]:
# Calculate token counts for each prompt in the DataFrame
q_and_a["token_count"] = q_and_a["prompt"].apply(lambda x: len(tokenizer.encode(x)))

# Determine the maximum token count from your dataset
max_token_len = q_and_a["token_count"].max()

if max_token_len > 512:
    num_exceeding = (q_and_a["token_count"] > 512).sum()
    print(f"Due to CUDA memory limits, we can only go up to 512 tokens. We will truncate {num_exceeding} samples.")
    max_token_len = 512



Due to CUDA memory limits, we can only go up to 512 tokens. We will truncate 4 samples.


In [13]:
def tokenize_function(examples):
    full_texts = [
        f"{prompt.strip()}\n\nResponse: {target.strip()}"
        for prompt, target in zip(examples["prompt"], examples["target"])
    ]
    return tokenizer(full_texts, truncation=True, max_length=max_token_len, padding="max_length")


In [5]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Define training arguments. Adjust batch sizes as needed to avoid memory issues.
training_arguments = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="steps",
    do_eval=True,
    optim="paged_adamw_8bit",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=4,
    log_level="debug",
    save_strategy="epoch",
    logging_steps=100,
    learning_rate=1e-4,
    fp16=True,
    bf16=False,
    eval_steps=100,
    num_train_epochs=1,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    local_rank=-1,
)

# Initialize the SFTTrainer.
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
)


comet_ml is installed but the Comet API Key is not configured. Please set the `COMET_API_KEY` environment variable to enable Comet logging. Check out the documentation for other ways of configuring it: https://www.comet.com/docs/v2/guides/experiment-management/configure-sdk/#set-the-api-key
comet_ml is installed but the Comet API Key is not configured. Please set the `COMET_API_KEY` environment variable to enable Comet logging. Check out the documentation for other ways of configuring it: https://www.comet.com/docs/v2/guides/experiment-management/configure-sdk/#set-the-api-key


Converting train dataset to ChatML:   0%|          | 0/96 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/96 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/96 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/24 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/24 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/24 [00:00<?, ? examples/s]

Using auto half precision backend
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [6]:

trainer.train()

# Save the fine-tuned model ("TaxSense") to the specified output directory.
model.save_pretrained(output_dir, safe_serialization=True)
print(f"TaxSense model saved to: {output_dir}")


Currently training with a batch size of: 4
The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: prompt, Context2, Context1, ID, Score1, target, Answer, Source, Question, Score2. If prompt, Context2, Context1, ID, Score1, target, Answer, Source, Question, Score2 are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 96
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 6
  Number of trainable parameters = 289,837,056


Step,Training Loss,Validation Loss


Saving model checkpoint to ./model_directory/models--zainnobody--TaxSense/checkpoint-6
loading configuration file ./model_directory/models--deepseek-ai--DeepSeek-V2-Lite-Chat/snapshots/85864749cd611b4353ce1decdb286193298f64c7/config.json
Model config DeepseekV2Config {
  "architectures": [
    "DeepseekV2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_deepseek.DeepseekV2Config",
    "AutoModel": "modeling_deepseek.DeepseekV2Model",
    "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
  },
  "aux_loss_alpha": 0.001,
  "bos_token_id": 100000,
  "eos_token_id": 100001,
  "ep_size": 1,
  "first_k_dense_replace": 1,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 10944,
  "kv_lora_rank": 512,
  "max_position_embeddings": 163840,
  "model_type": "deepseek_v2",
  "moe_intermediate_size": 1408,
  "moe_layer_freq": 1,
  "n_group": 1,
  "n_routed_experts":

TaxSense model saved to: ./model_directory/models--zainnobody--TaxSense/


The following code ensures that all tokenizer-related files are copied to the new model directory. 
Having the tokenizer files in one place is essential for consistency when loading the model 
for inference, ensuring the correct tokenizer configuration is used.

In [5]:
model_files = [
    "config.json",
    "special_tokens_map.json",
    "tokenizer.json",
    "tokenizer_config.json",
    "pytorch_model.bin",
    "adapter_config.json",
    "adapter_model.bin",
]

# Copy each file if it exists
for file_name in model_files:
    src_path = os.path.join(base_model_path, file_name)
    dest_path = os.path.join(output_dir, file_name)
    
    if os.path.isfile(src_path):
        shutil.copy(src_path, dest_path)
        print(f"Copied: {file_name} -> {output_dir}")
    else:
        print(f"Skipped: {file_name} (File not found)")

print("Model file transfer completed.")

Copied: config.json -> ./model_directory/models--zainnobody--TaxSense/
Skipped: special_tokens_map.json (File not found)
Copied: tokenizer.json -> ./model_directory/models--zainnobody--TaxSense/
Copied: tokenizer_config.json -> ./model_directory/models--zainnobody--TaxSense/
Skipped: pytorch_model.bin (File not found)
Skipped: adapter_config.json (File not found)
Skipped: adapter_model.bin (File not found)
Model file transfer completed.


In [9]:
from pathlib import Path
import os

def human_readable_size(size, decimal_places=1):
    for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
        if size < 1024.0:
            return f"{size:.{decimal_places}f} {unit}"
        size /= 1024.0
    return f"{size:.{decimal_places}f} TB"

directory = Path(base_model_path)
if directory.exists() and directory.is_dir():
    for file in directory.iterdir():
        size = human_readable_size(file.stat().st_size)
        print(f"{size} {file.name}")
else:
    print("Directory not found")

5.2 GB model-00004-of-000004.safetensors
1.2 KB tokenizer_config.json
10.1 KB configuration_deepseek.py
468.7 KB model.safetensors.index.json
1.5 KB config.json
8.0 GB model-00002-of-000004.safetensors
181.0 B generation_config.json
8.0 GB model-00001-of-000004.safetensors
4.4 MB tokenizer.json
76.8 KB modeling_deepseek.py
8.0 GB model-00003-of-000004.safetensors


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def load_model_and_tokenizer(model_path, device="cuda:0"):
    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"
    return model, tokenizer


In [None]:
model, tokenizer = load_model_and_tokenizer(output_dir)

Ran into following issue:

```
OSError: ./model_directory/models--zainnobody--TaxSense/ does not appear to have a file named modeling_deepseek.py. Checkout 'https://huggingface.co/./model_directory/models--zainnobody--TaxSense//tree/None' for available files.
```

Resolving this:

In [8]:
from pathlib import Path
import os

def human_readable_size(size, decimal_places=1):
    for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
        if size < 1024.0:
            return f"{size:.{decimal_places}f} {unit}"
        size /= 1024.0
    return f"{size:.{decimal_places}f} TB"

directory = Path(base_model_path)
if directory.exists() and directory.is_dir():
    for file in directory.iterdir():
        size = human_readable_size(file.stat().st_size)
        print(f"{size} {file.name}")
else:
    print("Directory not found")


5.2 GB model-00004-of-000004.safetensors
1.2 KB tokenizer_config.json
10.1 KB configuration_deepseek.py
468.7 KB model.safetensors.index.json
1.5 KB config.json
8.0 GB model-00002-of-000004.safetensors
181.0 B generation_config.json
8.0 GB model-00001-of-000004.safetensors
4.4 MB tokenizer.json
76.8 KB modeling_deepseek.py
8.0 GB model-00003-of-000004.safetensors


In [10]:
deepseek_model_file = "modeling_deepseek.py"
src_path = os.path.join(base_model_path, deepseek_model_file)
dest_path = os.path.join(output_dir, deepseek_model_file)

if os.path.exists(src_path):
    shutil.copy2(src_path, dest_path)  # copy2 preserves metadata
    print(f"Copied {deepseek_model_file} to {output_dir}")
else:
    print(f"File {deepseek_model_file} not found in {base_model_path}, check if it exists.")

Copied modeling_deepseek.py to ./model_directory/models--zainnobody--TaxSense/


In [None]:
output_dir = "./model_directory/models--zainnobody--TaxSense/"

model, tokenizer = load_model_and_tokenizer(output_dir)

print("Model loaded.")

I cleaned the last cell, as the output was huge.

The logs show that while loading the checkpoint shards, several weights weren't used during model initialization:

```
Loading checkpoint shards: 100%
 3/3 [00:47<00:00, 13.32s/it]
Some weights of the model checkpoint at ./model_directory/models--zainnobody--TaxSense/ were not used when initializing DeepseekV2ForCausalLM: {'model.layers.9.mlp.experts.60.up_proj.lora_A.default.weight', 'model.layers.13.mlp.experts.7.down_proj.lora_B.default.weight', 'model.layers.22.mlp.experts.44.up_proj.base_layer.weight.quant_map', 'model.layers.1.mlp.experts.48.gate_proj.base_layer.weight.quant_map',...
```

This happens because when fine-tuning with LoRA (using PEFT), additional parameters like `lora_A.default.weight` and `lora_B.default.weight` are injected into the model. However, using a standard call like `AutoModelForCausalLM.from_pretrained(finetuned_dir)` does not automatically load these extra LoRA weights, which results in the warning about unused or missing parameters.

In [1]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model_path = "./model_directory/models--deepseek-ai--DeepSeek-V2-Lite-Chat/snapshots/85864749cd611b4353ce1decdb286193298f64c7"
output_dir = "./model_directory/models--zainnobody--TaxSense/"
lora_output_dir = os.path.join(output_dir, "checkpoint-6")

max_memory = {
    0: "20GiB",
    1: "20GiB",
}

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    trust_remote_code=True,
    device_map="auto",
    max_memory=max_memory,
    torch_dtype=torch.float16,
)

model = PeftModel.from_pretrained(
    base_model,
    lora_output_dir,
    device_map="auto",
    max_memory=max_memory,
    torch_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(
    base_model_path,
    trust_remote_code=True
)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

print("Successfully loaded the model + LoRA adapter across multiple GPUs.")

2025-03-01 21:24:42.809166: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Successfully loaded the model + LoRA adapter across multiple GPUs.


Just manual checking: 

In [3]:
import pandas as pd
import torch
from transformers import GenerationConfig

DATA_DIR = "data"
q_and_a_path = f"{DATA_DIR}/tax-questions-answers-with-context.csv"
df = pd.read_csv(q_and_a_path)

def construct_prompt(question, context1, context2):
    return f"""You are a helpful tax advisor and legal expert. Use the provided context to answer the user's query in a clear and concise manner.

User Query: {question}

Related Context:
{context1}
{context2}

Note: The above information is extracted from relevant forms or online sources. Use it to formulate your response.

Response:
"""

def generate_reply(prompt, max_new_tokens=256):
    gen_config = GenerationConfig(
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad(), torch.cuda.amp.autocast(enabled=True):
        output_tokens = model.generate(
            **inputs,
            generation_config=gen_config,
            use_cache=False
        )

    return tokenizer.decode(
        output_tokens[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    )

indices_to_test = [0, 1, 2]

for idx in indices_to_test:
    question = df.loc[idx, "Question"]
    expected_answer = df.loc[idx, "Answer"]
    context1 = df.loc[idx, "Context1"]
    context2 = df.loc[idx, "Context2"]
    
    test_prompt = construct_prompt(question, context1, context2)
    model_reply = generate_reply(test_prompt)

    print("=" * 60)
    print(f"Sample Index: {idx}")
    print("-" * 60)
    print("QUESTION:")
    print(question)
    print("\nCONTEXT USED:")
    print("Context1:", context1)
    print("Context2:", context2)
    print("\nEXPECTED REPLY:")
    print(expected_answer)
    print("\nMODEL REPLY:")
    print(model_reply)
    print("=" * 60, "\n")


  with torch.no_grad(), torch.cuda.amp.autocast(enabled=True):
Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.


Sample Index: 0
------------------------------------------------------------
QUESTION:
How can I reduce my tax bill?

CONTEXT USED:
Context1: System at IRS.gov/SAMS.
For more information, go to IRS.gov/Advocate.
How To Make a Contribution To
Reduce Debt Held by the Public
There are two ways to make a contribution to reduce the
debt held by the public.
• At Pay.gov, contribute online by credit card, debit card,
PayPal, checking account, or savings account.
• Write a check payable to “Bureau of the Fiscal Service.”
In the memo section, notate that it is a gift to reduce the
debt held by the public.
Mail the check to:
Attn: Dept G
Context2: WV 26106-2188. Or you can enclose
the check with your income tax return
when you file. In the memo section of
the check, make a note that it is a gift to
reduce the debt held by the public. Don’t
add your gift to any tax you may owe.
See the instructions for line 37 for de-
tails on how to pay any tax you owe. For
information on how to make this type o

Setting `pad_token_id` to `eos_token_id`:100001 for open-end generation.


Sample Index: 1
------------------------------------------------------------
QUESTION:
What deductions am I eligible for?

CONTEXT USED:
Context1: Standard Deduction (Group I Only)
If you do not itemize your deductions, you can take the 2025
standard deduction listed below for your filing status.
Filing Status
Standard
Deduction
Married filing jointly or
Qualifying surviving spouse
. . . . . . . . . . . . . .
$30,000\*
Head of household . . . . . . . . . . . . . . . . . . . .
$22,500\*
Single or Married filing
separately
. . . . . . . . . . . . . . . . . . . . . . . . .
$15,000\*
Context2: 33
Standard Deduction Worksheet for Dependents—Line 12
Keep for Your Records
Use this worksheet only if someone can claim you, or your spouse if filing jointly, as a dependent.
1.
Check if:
You were born before January 2, 1960.
You are blind.
Spouse was born before January 2, 1960.
Spouse is blind.
Total number of boxes
checked
. . . . . . . . . . . . . . . . . .
1.
2.
Is your earned income\* more th

**Challenges Encountered:**

1. **Time Efficiency:** The fine-tuned model significantly increases processing time, taking 15 minutes to handle just 3 queries, whereas other models complete this task within seconds.
   
2. **Resource Intensive:** The model demands higher GPU resources for inference, which may not be feasible for all setups.

3. **Data Requirements:** Effective fine-tuning requires a substantial volume of high-quality data, which implies extended periods for data collection.

4. **Model Complexity:** Our contributions added 289,837,056 trainable parameters to the model. However, DeepSeek V2 Lite already operates with 16 billion total parameters and 2.4 billion active ones, offering high efficiency and speed that might not be fully leveraged by our additions.

Efforts to reduce the `max_new_tokens` parameter to 128 yielded a marginal improvement in response time to 3 minutes and 52 seconds, albeit with shorter output responses.


The primary focus of this project is on enhancing the ML Ops aspects of AI, emphasizing efficient execution and processing of user responses rather than optimizing a single model. With that in mind, we will include the "TaxSense" model as one of the options available to users. Also, the model will be available for further testing on hugging face. 

### Hugging Face upload

In [8]:
from huggingface_hub import HfApi, create_repo, upload_folder

api = HfApi()
create_repo(repo_id="zainnobody/TaxSense", private=False, exist_ok=True)
upload_folder(
    repo_id="zainnobody/TaxSense",
    folder_path="./model_directory/models--zainnobody--TaxSense/",
    path_in_repo=".",
    commit_message="Initial commit"
)

rng_state.pth:   0%|          | 0.00/14.2k [00:00<?, ?B/s]

optimizer.pt:   0%|          | 0.00/601M [00:00<?, ?B/s]

scaler.pt:   0%|          | 0.00/988 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Upload 20 LFS files:   0%|          | 0/20 [00:00<?, ?it/s]

scheduler.pt:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.62k [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/839M [00:00<?, ?B/s]

events.out.tfevents.1740856547.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.176706.0:   0%|          | 0.00/7.39k [00:…

events.out.tfevents.1740857662.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.194220.0:   0%|          | 0.00/6.98k [00:…

events.out.tfevents.1740857841.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.196038.0:   0%|          | 0.00/6.98k [00:…

events.out.tfevents.1740857960.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.197340.0:   0%|          | 0.00/6.98k [00:…

events.out.tfevents.1740858035.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.198196.0:   0%|          | 0.00/7.39k [00:…

events.out.tfevents.1740858797.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.205802.0:   0%|          | 0.00/6.98k [00:…

events.out.tfevents.1740858884.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.206757.0:   0%|          | 0.00/6.98k [00:…

events.out.tfevents.1740858974.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.207760.0:   0%|          | 0.00/6.98k [00:…

events.out.tfevents.1740859189.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.209938.0:   0%|          | 0.00/88.0 [00:0…

events.out.tfevents.1740859411.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.212192.0:   0%|          | 0.00/6.98k [00:…

events.out.tfevents.1740859854.b3c7d853-4c5a-4a0a-ab54-087dc02516fb.216891.0:   0%|          | 0.00/7.39k [00:…

CommitInfo(commit_url='https://huggingface.co/zainnobody/TaxSense/commit/60fdc1255746c082f57b93e6615cf047e2adcd53', commit_message='Initial commit', commit_description='', oid='60fdc1255746c082f57b93e6615cf047e2adcd53', pr_url=None, repo_url=RepoUrl('https://huggingface.co/zainnobody/TaxSense', endpoint='https://huggingface.co', repo_type='model', repo_id='zainnobody/TaxSense'), pr_revision=None, pr_num=None)