<a href="https://colab.research.google.com/github/iamshivamkumarsharma/Automated-Cow-Data-extraction-NLP-PROJECT-/blob/main/Automated_Cow_Data_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hello everyone, in this demonstration, I'll walk you through our project on Automated Cow Data Extraction.

Step 1: Setting Up the Environment  
First, let's ensure we have the necessary tools installed. We'll start by installing the pydub library for audio processing.

In [55]:
!pip install pydub




Next, we'll mount Google Drive to access our project files.

In [56]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Then, we'll install the SpeechRecognition library for speech recognition.

In [57]:
!pip install SpeechRecognition
import speech_recognition as sr



Step 2: Preprocessing the Audio

We'll preprocess the audio to enhance transcription accuracy. This includes normalization, silence removal, and noise reduction.


Tokenization:
Tokenization is the process of splitting the transcribed text into individual words. We'll use the word_tokenize function from the nltk.tokenize module for this purpose.






In [58]:
import re   #regular expression
import spacy
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
# List of pre-stored cow IDs
cow_ids = [
    "Rama", "67GB",
    "Maneesha", "F29",
    "XX35", "Mala",
    "Lali", "MX90",
    "Aarti", "babila",
    "Usha", "Ragini",
    "W-704", "Ranu",
    "Mamta", "27 A",
    "46 2 B", "55",
    "Basanti", "297"
]

# Tokenize the transcribed text into words
tokens = word_tokenize(transcription)





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We define a function preprocess_text() to handle text preprocessing tasks.

In [59]:
import re

def preprocess_text(transcription):
    # Convert text to lowercase
    transcription = transcription.lower()

    # Replace "w" with "W" and add "-" before numbers
    transcription = re.sub(r'\b(w)\b', 'W', transcription)
    transcription = re.sub(r'(\b\d+\b)', r'-\1', transcription)

    return transcription




**Named Entity Recognition (NER)** is a technique to identify and classify named entities in the text. In this case, we'll use spaCy for NER to identify cow IDs mentioned in the transcribed text.

In [60]:

!pip install numpy scipy
!pip install noisereduce





In [61]:
import spacy

# Load the spaCy model for NER
nlp = spacy.load("en_core_web_sm")
#`en_core_web_sm` is an English language multi-task Convolutional Neural Network(CNN) trained on OntoNotes.
# Assigns context-specific token vectors, POS tags, dependency parse, and named entities  # POS-parts of speech

# Use spaCy to perform NER on the transcribed text
doc = nlp(transcription)


Preprocessing Steps:
Normalize the Audio: Ensure the audio volume is consistent.
Remove Silence: Remove long silences that might confuse the recognizer.
Reduce Noise: Apply noise reduction to clean up the audio.

In [62]:
!ls /content/drive/MyDrive/COW_PROJECT_NLP/






ls: cannot access '/content/drive/MyDrive/COW_PROJECT_NLP/': No such file or directory


In [63]:
from pydub import AudioSegment
from scipy.io import wavfile
import numpy as np
import noisereduce as nr

def preprocess_audio(input_file, output_file):
    # Load the audio file
    audio = AudioSegment.from_file(input_file)

    # Normalize audio
    audio = audio.apply_gain(-audio.max_dBFS)

    # Remove silence
    silence_threshold = -40  # in dB
    chunks = audio.split_to_mono()
    non_silent_chunks = [chunk for chunk in chunks if chunk.dBFS > silence_threshold]
    processed_audio = sum(non_silent_chunks)

    # Export the processed audio to a temporary file
    temp_file = "temp.wav"
    processed_audio.export(temp_file, format="wav")

    # Load the temp file using scipy
    rate, data = wavfile.read(temp_file)

    # Apply noise reduction
    reduced_noise = nr.reduce_noise(y=data, sr=rate)

    # Save the processed audio
    wavfile.write(output_file, rate, reduced_noise)

input_audio = "/content/drive/MyDrive/COW PROJECT NLP/audio1.wav"
preprocessed_audio = "/content/drive/MyDrive/COW PROJECT NLP/audio1.wav"

# Convert and preprocess the audio
audio_segment = AudioSegment.from_file(input_audio)
audio_segment.export(preprocessed_audio, format="wav")
preprocess_audio(preprocessed_audio, preprocessed_audio)


**IBM Watson Speech to Text:** **Not using due to API issue **


**Step 4: Fine-Tuning
Google Speech Recognition:
Adjust sensitivity to background noise:**

In [64]:
recognizer = sr.Recognizer()
recognizer.energy_threshold = 300
recognizer.dynamic_energy_threshold = True


Transcribing the Audio

We'll convert the audio file to WAV format and transcribe it using the Google Speech Recognition service.

Extracting Data

We'll extract cow IDs and milk yields from the transcription using NLP and regex.

Executing the Program

We'll execute the program by converting the audio file to WAV format, transcribing it, and extracting data.

In [65]:
import os
import re
import spacy
import nltk

from pydub import AudioSegment
from nltk.tokenize import word_tokenize

nltk.download('punkt')

def convert_to_wav(mp4_file):
    audio = AudioSegment.from_file(mp4_file)
    wav_file = os.path.splitext(mp4_file)[0] + ".wav"
    audio.export(wav_file, format="wav")
    return wav_file

def transcribe_audio(audio_file):
    recognizer = sr.Recognizer()
    with sr.AudioFile(audio_file) as source:
        audio_data = recognizer.record(source)
    try:
        transcription = recognizer.recognize_google(audio_data)
        return transcription
    except sr.UnknownValueError:
        return "Google Speech Recognition could not understand the audio"
    except sr.RequestError as e:
        return f"Could not request results from Google Speech Recognition service; {e}"

def extract_data(transcription):
    cow_ids = [
        "Rama", "67GB", "Maneesha", "F29", "XX35", "Mala", "Lali", "MX90",
        "Aarti", "babila", "Usha", "Ragini", "00704", "Ranu", "Mamta", "27 A",
        "46 2 B", "55", "Basanti", "297"
    ]

    # Tokenize the transcribed text
    tokens = word_tokenize(transcription)

    # Load the spaCy model for NER
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(transcription)

    identified_cow_ids = []
    milk_yields = []

    # Extract cow IDs using NER and regex
    for ent in doc.ents:
        if ent.label_ == 'PERSON' and ent.text in cow_ids:
            identified_cow_ids.append(ent.text)

    for token in tokens:
        if token in cow_ids:
            identified_cow_ids.append(token)

    # Extract milk yields using regex
    milk_yield_patterns = re.findall(r'\b(\d+)\s?Kg\b', transcription, re.IGNORECASE)
    milk_yields.extend(milk_yield_patterns)

    # Return the extracted data
    return identified_cow_ids, milk_yields

# Convert mp4 to wav
audio_file_mp4 = "/content/drive/MyDrive/COW PROJECT NLP/audio1.mp4"
audio_file_wav = convert_to_wav(audio_file_mp4)

# Transcribe the audio file
transcription = transcribe_audio(audio_file_wav)
print("Transcription:", transcription)


# Extract cow IDs and milk yields
cow_ids, milk_yields = extract_data(transcription)



print("Identified Cow IDs:", cow_ids)
print("Identified Milk Yields in kg:", milk_yields)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Transcription: the yield of 00704 today is 8 kg
Identified Cow IDs: ['00704']
Identified Milk Yields in kg: ['8']
