Step 1: Install spaCy and Download Language Model
First, make sure you have spaCy installed and download a language model. For example, you can download the English language model:

In [None]:
pip install spacy
python -m spacy download en_core_web_sm

In [None]:
Step 2: Data Loading and Cleaning
Assume you have a dataset stored in a CSV file. Here's how you can load and clean the data:

In [None]:
import pandas as pd
import spacy

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Example dataset (replace with your own dataset)
data = pd.DataFrame({
    'text': [
        "Apple is looking at buying U.K. startup for $1 billion.",
        "I love reading books in the library.",
        "John Doe lives in New York City."
    ]
})

# Function for text cleaning and spaCy processing
def process_text(text):
    # Remove extra whitespace
    text = text.strip()
    # Process text with spaCy
    doc = nlp(text)
    return doc

# Apply processing function to the 'text' column
data['processed'] = data['text'].apply(process_text)


In [None]:
Step 3: Exploratory Data Analysis (EDA)

In [None]:
# Example: Named Entity Recognition (NER) analysis
entities = []
for doc in data['processed']:
    for ent in doc.ents:
        entities.append((ent.text, ent.label_))

# Convert entities to DataFrame
entities_df = pd.DataFrame(entities, columns=['Entity', 'Label'])

# Count and plot entity labels
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 6))
sns.countplot(y='Label', data=entities_df, palette='viridis')
plt.title('Named Entity Recognition (NER)')
plt.xlabel('Count')
plt.ylabel('Entity Label')
plt.show()


In [None]:
Step 4: Visualization
Visualize token frequencies or any other relevant insights:

In [None]:
# Example: Token frequency visualization
from collections import Counter

all_tokens = []
for doc in data['processed']:
    tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
    all_tokens.extend(tokens)

# Count token frequencies
token_freq = Counter(all_tokens)

# Plot top 20 tokens
top_tokens = token_freq.most_common(20)
plt.figure(figsize=(10, 6))
sns.barplot(x=[token[1] for token in top_tokens], y=[token[0] for token in top_tokens], palette='muted')
plt.title('Top 20 Most Common Tokens')
plt.xlabel('Frequency')
plt.ylabel('Token')
plt.show()


Additional Steps (Optional)
You can extend this example with more advanced spaCy features such as dependency parsing, sentiment analysis, or customizing the processing pipeline based on your specific needs.

Conclusion
This implementation demonstrates how to perform basic data analysis on an NLP dataset using spaCy in Python. SpaCy provides powerful capabilities for text processing, entity recognition, and syntactic analysis, making it suitable for a wide range of NLP tasks. Depending on your dataset and objectives, you can further customize and expand upon these examples to extract meaningful insights from your text data.