# Downloading French Legal Codes

This notebook demonstrates how to download various French legal codes from Légifrance and save them into HTML files. The codes include the Civil Code, Penal Code, Commercial Code, and many others.

[Download a law](https://www.legifrance.gouv.fr/search/all?tab_selection=all&searchField=ALL&query=&page=1&init=true)

## Legal Codes Included
1. Code de l'action sociale et des familles
2. Code de l'artisanat
3. Code des assurances
4. Code de l'aviation civile
5. Code du cinéma et de l'image animée
6. Code civil V
7. Code de la commande publiqu
8. Code de commerce V
9. Code des communes V
10. Code de la consommation
11. Code de la construction et de l'habitation
12. Code de la défense
13. Code de l'éducation
14. Code électoral
15. Code de l'énergie
16. Code de l'environnement
17. Code de l'entrée et du séjour des étrangers et du droit d'asile
18. Code général des collectivités territoriales
19. Code général des impôts
20. Code général de la fonction publique


In [6]:
import yake

def extract_keywords_from_article(text, num_keywords):
    kw_extractor = yake.KeywordExtractor(lan="en", n=3, dedupLim=0.9, top=num_keywords)
    keywords = kw_extractor.extract_keywords(text)
    return [kw[0] for kw in keywords]

def generate_embeddings(text):
  return [0.0] * 768

In [7]:
# main.py

import os
import logging
from law_pdf_parser import extract_articles_from_pdf
from database_interfaces import ChromaDBInterface
from database_interfaces import Neo4jInterface
from tqdm import tqdm
from dotenv import load_dotenv 
load_dotenv() 

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def process_article(article, document_name, chroma_db, neo4j_db):
    """Process a single article."""
    try:
        # Generate embeddings
        article_embedding = generate_embeddings(article['content'])
        
        # Extract keywords
        num_keywords = max(int(len(article['content']) * 0.05), 1)
        keywords = extract_keywords_from_article(article['content'], num_keywords)
        
        # Store in ChromaDB
        chroma_db.store_article(document_name, article['number'], article['content'], article_embedding)
        chroma_db.store_keywords(document_name, article['number'], keywords)
        
        # Store in Neo4j
        neo4j_db.store_article(document_name, article['number'], article['content'])
        
        logger.info(f"Processed Article {article['number']} from {document_name}")
    except Exception as e:
        logger.error(f"Error processing Article {article['number']} from {document_name}: {str(e)}")

def main():
    chroma_db = ChromaDBInterface(os.environ.get('CHROMA_DB_PATH'))
    neo4j_db = Neo4jInterface(os.environ.get('NEO4J_URI'), os.environ.get('NEO4J_USER'), os.environ.get('NEO4J_PASSWORD'))
    
    # Process each PDF in the input directory
    input_dir = "database/laws/"
    for filename in tqdm(os.listdir(input_dir), desc="Processing files"):
        if filename.endswith('.pdf'):
            pdf_path = os.path.join(input_dir, filename)
            document_name = os.path.splitext(filename)[0]
            
            logger.info(f"Processing {filename}")
            
            try:
                articles = extract_articles_from_pdf(pdf_path)
                for article in articles:
                    process_article(article, document_name, chroma_db, neo4j_db)
                
                logger.info(f"Completed processing {filename}")
            except Exception as e:
                logger.error(f"Error processing {filename}: {str(e)}")
    
    # Close database connections
    chroma_db.close()
    neo4j_db.close()

if __name__ == "__main__":
    main()

ValueError: [91mYou are using a deprecated configuration of Chroma.

[94mIf you do not have data you wish to migrate, you only need to change how you construct
your Chroma client. Please see the "New Clients" section of https://docs.trychroma.com/migration.
________________________________________________________________________________________________

If you do have data you wish to migrate, we have a migration tool you can use in order to
migrate your data to the new Chroma architecture.
Please `pip install chroma-migrate` and run `chroma-migrate` to migrate your data and then
change how you construct your Chroma client.

See https://docs.trychroma.com/migration for more information or join our discord at https://discord.gg/8g5FESbj for help![0m

# Future prompts :

---
```plain_text
For context here are all the previous user's query
[conversation, only user inputs]

Following, is their last query
[last user input]

Generate a more refined query of the last user's query based on all the previous queries. Your refined query will be use to prompt a retrieval chain so make it clear and self contained.
Make sure the query is self contained and contains all the information required to properly query a law database.
You will achieve this after a step by step detailed break down of logic.
At the end of the break down you will finally give your final answer and final query using the following format, capital words are my place holder :
"<query>
FINAL_ENHANCED_QUERY
</query>"
Please respect the given format.
```
---

-> Put the result {query} of this into the following :

---

# Future code :

```python
pre_processed_query = query_pre_processing(query)
enhanced_query = enhance_query_keywords(pre_processed_query)
```
---
```python
def enhance_query_keywords(query):
    logger.info(f"Enhancing query with keywords: {query}")
    try:
        # Extract keywords from the query
        query_keywords = extract_keywords(query, num_of_keywords=max(0.3*len(query.split(" ")), 8))
        # Generate embeddings for each query keyword
        embeddings = {keyword: get_embeddings_openai(keyword) for keyword in query_keywords}
        enhanced_keywords = {}
        for keyword, embedding in embeddings.items():
            # Find similar keywords in the collection using the embedding
            results = collection_embeddings.query(query_embeddings=[embedding], n_results=5)
            similar_keywords = results['documents'][0]  # Assume results are structured with documents containing the keywords
            # Add similar keywords to the set
            enhanced_keywords[keyword] = set(similar_keywords)
        # Replace each keyword in the original query with the augmented keywords
        enhanced_query = []
        for word in query.split():
            if word in enhanced_keywords:
                # Replace the word with the original and similar keywords
                augmented_keywords = ' '.join(enhanced_keywords[word])
                enhanced_query.append(f"{word} {augmented_keywords}")
            else:
                enhanced_query.append(word)
        enhanced_query_str = ' '.join(enhanced_query)
        logger.info(f"Enhanced query: {enhanced_query_str}")
        return enhanced_query_str
    except Exception as e:
        logger.error(f"Error enhancing query keywords: {e}")
        return query
```
---
```python
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
def query_pre_processing(query):
    logger.info(f"{WHITE}Pre-processing query: {BLUE}{ITALIC}{query}{RESET}")
    try:
        # Tokenization
        tokens = word_tokenize(query.lower())
        # Stopword Removal
        stop_words = set(stopwords.words('english'))
        tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
        # Stemming
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(word) for word in tokens]
        # Lemmatization
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
        processed_query = ' '.join(tokens)
        logger.info(f"{WHITE}Processed query: {BLUE}{ITALIC}{processed_query}{RESET}")
        return processed_query
    except Exception as e:
        logger.error(f"Error in query pre-processing: {e}")
        return query
```