# Problem set 8: Mini-project

We've put some effort into building our collection (see problem set 7 for details and for links to texts and to metadata). Now it's time to learn something about it. You already have lots of excellent ideas for how to apply the tools we've learned about so far. It's also a good time of the semester to review what we have learned and practice applying it in less structured settings.

**You will work by yourself or in a group of up to three people** to complete a short project applying methods from the previous weeks to this collection. You will turn in the completed project as a single notebook (one submission per group) with the following sections:

1. **Question(s).** Describe what you wanted to learn. Suggest several possible answers or hypotheses, and describe in general terms what you might expect to see if each of these answers were true (save specific measurements for the next section). For example, many students want to know the difference between horror and non-horror, or between detective stories and horror fiction, but there are many ways to operationalize this question. You do not need to limit yourself to questions of genre. **Note that your question should be interesting! If the answer is obvious before you begin, or if it's something the importance of which you cannot explain, your grade will suffer (a lot).** (10 points)

1. **Methods.** Describe how you will use computational methods presented so far in this class to answer your question. What do the computational tools do, and how does their output relate to your question? Describe how you will process the collection into a form suitable for a model or algorithm and why you have processed it the way you have. (10 points)

1. **Code.** Carry out your experiments. Code should be correct (no errors) and focused (unneeded code from examples is removed). Use the notebook format effectively: code may be incorporated into multiple sections. (20 points)

1. **Results and discussion.** Use sorted lists, tables, and visual presentations to make your argument. Excellent projects will provide multiple views of results, and follow up on any apparent outliers or strange cases, including through careful reading of the original documents. (40 points)

1. **Reflection.** Describe your experience in this process. What was harder or easier than you expected? What compromises or negotiations did you have to accept to match the collection, the question, and the methods? What would you try next? (10 points)

1. **Responsibility and resources consulted.** Credit any online sources (Stack Overflow, blog posts, documentation) that you found helpful. (0 points, but -10 if missing)
    * **If you worked in a group**, set up a group submission in CMS. Each group member should submit (via CMS) a separate text file in which they describe each member's (including their own) contributions to the project.
    * Most people will turn in *either* a completed notebook for their solo project *or* a responsibility statement. The only people who will submit both files are those who are the designated submitter for their group. Don't worry if CMS warns you about a missing file (unless you're the group submitter).

Note that 10 points will be carried over from problem set 7.

**We will grade this work based on accuracy, thoroughness, creativity, reflectiveness, and quality of presentation.**

**Scope:** this is a *mini*-project, with a short deadline. We are expecting work that is consistent with that timeframe, but that is serious, thoughtful, and rigorous. This problem set will almost certainly require more time and effort than many of the others. **For group work, the expected scope grows linearly with the number of participants.**

# 0. Project team

List here the members of your project team, including yourself.

# 1. Question(s)

# 2. Methods

# 3. Code

In [2]:
# Imports (all of them!)
import pandas as pd
from pathlib import Path

In [65]:

path = Path('C:\\')/'Users'/'samsu'/'OneDrive'/'Documents'/'GitHub'/'info3350-f20'/'data'/'mini_project'/'Info6350_MiniP_data_v2.csv'
data = pd.read_csv(path)

# generate dummy variables for English, PovFirst, female, horror, detective and adaptation (1 if "True")

data["dHorror"] = data["horror"].astype(int)
data["dDetective"] = data["detective"].astype(int)
data["dAdaptation"] = data["adaptation"].astype(int)
data['dEnglish'] = np.where(data['language'] == 'en', 1, 0)
data['dPovFirst'] = np.where(data['pov'] == 'first', 1, 0)
data['dFemale'] = np.where(data['gender'] == 'female', 1, 0)

# convert all 'gb' to  'uk' 
data['country'] = data['country'].replace(['gb'],'uk')

# generate dummy variables for Romantic, Victorian, Modern and PostModern literature period (1 if "True")
data['dRomanticP'] = np.where((data['year'] >=  1790) & (data['year'] <= 1830), 1, 0)
data['dVictorianP'] = np.where((data['year'] >=  1832) & (data['year'] <= 1901), 1, 0)
data['dModernP'] = np.where((data['year'] >=  1914) & (data['year'] <= 1945), 1, 0)
data['dPostModernP'] = np.where((data['year'] >=  1945), 1, 0)

# generate dummy variables for FranceWar, USWar, UKWar, GermanyWar and War literature period (1 if the written was done during the writer's motherland war)
data['dFranceWar'] = np.where((data['year'] >=  1830) & (data['year'] <= 1848) & (data['country'] == 'fr'), 1, 0)
data['dUKWar'] = np.where((((data['year'] >=  1914) & (data['year'] <= 1918)) | ((data['year'] >=  1939) & (data['year'] <= 1945))) & (data['country'] == 'uk'), 1, 0)
data['dUSWar'] = np.where((((data['year'] >=  1914) & (data['year'] <= 1918)) | ((data['year'] >=  1939) & (data['year'] <= 1945)) | ((data['year'] >=  1861) & (data['year'] <= 1865))) & (data['country'] == 'us'), 1, 0)
data['dGermanyWar'] = np.where((((data['year'] >=  1914) & (data['year'] <= 1918)) | ((data['year'] >=  1939) & (data['year'] <= 1945))) & ((data['country'] == 'de') | (data['country'] == 'cz')), 1, 0)
data['dWar'] = np.where((data['dFranceWar'] ==  1) | (data['dUKWar'] ==  1)| (data['dUSWar'] ==  1)| (data['dGermanyWar'] ==  1), 1, 0)

# summary statistics for key variables
data[['wordcount', 'language', 'age', 'dScience', 'dHorror', 'dDetective', 'dAdaptation', 'dEnglish', 'dPovFirst', 'dFemale', 'dRomanticP', 'dVictorianP', 'dModernP', 'dPostModernP', 'dWar']].describe()  





Unnamed: 0,wordcount,age,dScience,dHorror,dDetective,dAdaptation,dEnglish,dPovFirst,dFemale,dRomanticP,dVictorianP,dModernP,dPostModernP,dWar
count,133.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0,136.0
mean,85418.56391,42.205882,0.213235,0.286765,0.25,0.389706,0.985294,0.448529,0.522059,0.080882,0.330882,0.220588,0.154412,0.095588
std,77294.407893,10.918634,0.411107,0.453923,0.434613,0.489486,0.120818,0.499182,0.50136,0.273662,0.47227,0.416176,0.362679,0.295113
min,2764.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48469.0,33.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,61568.0,42.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,104438.0,49.0,0.0,1.0,0.25,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
max,667000.0,71.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# 4. Results and discussion

# 5. Reflection

# 6. Responsibility and resources consulted