Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A4: what does 'linked' in get_links() mean? #159

Closed
dakshaau opened this issue Apr 19, 2016 · 8 comments
Closed

A4: what does 'linked' in get_links() mean? #159

dakshaau opened this issue Apr 19, 2016 · 8 comments

Comments

@dakshaau
Copy link

Sir,

The description of the get_links method in assignment 4 says

Return a SortedSet of computer scientist names that are linked from this
html page. The return set is restricted to those people in the provided
set of names. The returned list should contain no duplicates.

What does that mean exactly? Should i just check for the presence of every name in the HTML file or the presence of "/wiki/". If latter is the case then should an outlink to the name itself be removed from the linked name set?

@aronwc
Copy link
Member

aronwc commented Apr 20, 2016

You should extract all href links from this page, and filter them to ones that contain a suffix that match an element of the names parameter. See the doctest for an example.
Note that the return type is a SortedSet.

Self-links should be returned by the get_links method, but you should then filter them out in the read_links method. The read_links method will call get_links.

@dakshaau
Copy link
Author

I am getting 5035 inlinks and outlinks instead of 5047. The ranks are correct with different values though.

I was getting a charmap error while forming a string from the HTML file. To solve this I used "encoding=utf8". Do I have to use a different encoding to get the correct results?

@aronwc
Copy link
Member

aronwc commented Apr 20, 2016

Hmm...can you confirm that read_names returns 509 names? Some of the file names have strange characters, which perhaps is handled differently by different operating systems.
For reference, I've included here the number of outlinks found for each page.
outlinks.txt

@dakshaau
Copy link
Author

dakshaau commented Apr 20, 2016

The read_names is returning 509 names. there seems to be a difference of 1 outlink for most(490) of the names.

I have attached my output for outlinks
myout.txt

@aronwc
Copy link
Member

aronwc commented Apr 20, 2016

Perhaps you should not assume the /wiki/ prefix.

@dakshaau
Copy link
Author

Sir,

I tried 2-3 variations for finding the outlinks.

  1. I removed '/wiki/' from the search criteria: 435 names have different numbers
  2. I retained the self names i.e., Ada_Lovelace in Ada_Lovelace: 94 names have different numbers
  3. I kept '/wiki/' and retained self names: 33 names have different outlink length

None of the above versions had total outlink near 5047 though

In the description of read_links(), outlinks['Ada_Lovelace'] has 2 outlinks, but in the outlinks.txt, you provided for reference, has 3 outlinks.

@aronwc
Copy link
Member

aronwc commented Apr 20, 2016

Here are the three links get_links should return for Ada_Lovelace:

['Ada_Lovelace', 'Alan_Turing', 'Charles_Babbage']
Inside the read_links function, the self link should be removed, leaving
Turing & Babbage.

On Wed, Apr 20, 2016 at 12:21 PM, dakshaau notifications@github.com wrote:

Sir,

I tried 2-3 variations for finding the outlinks.

  1. I removed '/wiki/' from the search criteria: 435 names have
    different numbers
  2. I retained the self names i.e., Ada_Lovelace in Ada_Lovelace: 94
    names have different numbers
  3. I kept '/wiki/' and retained self names: 33 names have different
    outlink length

None of the above versions had total outlink near 5047 though

In the description of read_links(), outlinks['Ada_Lovelace'] has 2
outlinks, but in the outlinks.txt, you provided for reference, has 3
outlinks.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#159 (comment)

@dakshaau
Copy link
Author

dakshaau commented Apr 20, 2016

Sir,

I think I found the issue. there is a name in your outlinks 'Guy_L._Steele,_Jr.' but in my data folder the name of the file is 'Guy_L._Steele,_Jr' without '.' because of Windows OS. The name of this file is correct in the archive but when it is extracted the second '.' disappears.

I have 12 less links, and since this name is read wrong, then probably this is the one causing the problem.

What should be done in this case?

EDIT: Adding '.' forcibly to the name 'Guy_L._Steele,_Jr' fixed the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants