## EDA automation using LLMs

In [27]:
import pandas as pd
import ollama
import gradio as gr


  from .autonotebook import tqdm as notebook_tqdm


In [12]:
data=pd.read_csv("titanic dataset.csv")

In [13]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [15]:
data.isnull().sum()


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Filling the misisng data

In [16]:
data['Age']=data['Age'].fillna(data['Age'].median())
data['Embarked']=data['Embarked'].fillna(data['Embarked'].mode()[0])

In [17]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

In [None]:
def eda_analysise(data_summary):
    prompt=f"Analyze given summary and give ingsights:\n\n{data_summary}"
    summary=ollama.chat(model="mistral",messages=[{"role":"user","content":prompt}])
    return summary["message"]["content"]

In [22]:
data_summary=data.describe()
ingsights=eda_analysise(data_summary)
print(ingsights)

 Insights from the summary:

1. The total number of passengers is 891, with no missing values for any feature.

2. The average age of passengers is approximately 29.4 years, and the standard deviation is around 13.0 years. This suggests a wide range of ages among the passengers.

3. The majority (~62%) of passengers did not survive the sinking of the Titanic, as indicated by the mean Survived value of 0.38.

4. Passengers were predominantly from the Third Class (Pclass=3), with an average Pclass of 2.3 and a standard deviation of 0.8. This implies that there were fewer First Class (Pclass=1) and Second Class (Pclass=2) passengers compared to Third Class passengers.

5. The average number of siblings/spouses accompanying passengers was about 0.5, while the average number of parents/children with them was approximately 0.4. This suggests that only a small percentage of passengers had family members traveling with them.

6. The average Fare paid by passengers was around $32.20, but there 

In [23]:
def eda_analysise_model_2(data_summary):
    prompt=f"Analyze given summary and give ingsights:\n\n{data_summary}"
    summary=ollama.chat(model="deepseek-r1",messages=[{"role":"user","content":prompt}])
    return summary["message"]["content"]

In [24]:
def eda_analysise_model_3(data_summary):
    prompt=f"Analyze given summary and give ingsights:\n\n{data_summary}"
    summary=ollama.chat(model="llama3.2-vision",messages=[{"role":"user","content":prompt}])
    return summary["message"]["content"]

In [25]:
data_summary=data.describe()
ingsights=eda_analysise_model_2(data_summary)
print(ingsights)

<think>
Okay, so I'm trying to analyze this summary of the Titanic passenger data. Let me go through each section step by step.

First, looking at the PassengerId, all values are present since count is 891 and mean is 446. So no missing data there.

Next up are the Survived stats. The mean is about 0.384, which means only around 38% of passengers survived. That's pretty low. Half of them didn't survive at all because both median and 25% quartiles are zero. It seems like survival was rare.

Then there's Pclass information. Mean is roughly 2.309, so most passengers were in the second class (since classes go from lower to higher numerically). Minimum age in third class is as low as 0.42 years old. The majority are adults because 75% of them are over 28.

Looking at SibSp, which is siblings or spouses, a lot of people had none. Only 36% had one sibling/spouse and nearly all had less than two on average. 

Parch shows that most passengers didn't have any parents/children with them—only abou

In [26]:
data_summary=data.describe()
ingsights=eda_analysise_model_3(data_summary)
print(ingsights)

**Summary Analysis**

The provided summary appears to be a statistical overview of the Titanic dataset, specifically focusing on the passenger demographics and characteristics.

**Key Insights:**

1. **Survival Rate**: The survival rate is approximately 38% (0.383838). This implies that about two-thirds of passengers did not survive the tragedy.
2. **Age Distribution**: The mean age is around 29 years old, with a standard deviation of over 13 years. This suggests a relatively young passenger population, with many in their mid-to-late 20s and early 30s.
3. **Social Class**: The majority of passengers (over 70%) were from the lower classes (Pclass = 3), while only about 18% belonged to the upper class (Pclass = 1).
4. **Fare Distribution**: Fares ranged from $0 to over $500, with a mean fare around $32. The majority of passengers paid between $7 and $31.
5. **Family Size**: The average number of siblings/spouses (SibSp) is about half, suggesting many families were traveling together. How

In [None]:
def eda_analysise(file):
    data=pd.read_csv(file)
    data_summary=data.describe().to_string()
    prompt=f"Analyze given summary and give ingsights:\n\n{data_summary}"
    summary=ollama.chat(model="mistral",messages=[{"role":"user","content":prompt}])
    return summary["message"]["content"]
demo=gr.Interface(fn=eda_analysise,inputs="file",outputs='text',title="AI powered EDA using Mistral")
demo.launch()

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.


